AIO v2.0

Started by Andres Freundover 1 year ago213 messages
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

It's been quite a while since the last version of the AIO patchset that I have
posted. Of course parts of the larger project have since gone upstream [1]bulk relation extension, streaming read.

A lot of time since the last versions was spent understanding the performance
characteristics of using AIO with WAL and understanding some other odd
performance characteristics I didn't understand. I think I mostly understand
that now and what the design implications for an AIO subsystem are.

The prototype I had been working on unfortunately suffered from a few design
issues that weren't trivial to fix.

The biggest was that each backend could essentially have hard references to
unbounded numbers of "AIO handles" and that these references prevented these
handles from being reused. Because "AIO handles" have to live in shared memory
(so other backends can wait on them, that IO workers can perform them, etc)
that's obviously an issue. There was always a way to just run out of AIO
handles. I went through quite a few iterations of a design for how to resolve
that - I think I finally got there.

Another significant issue was that when I wrote the AIO prototype,
bufmgr.c/smgr.c/md.c only issued IOs in BLCKSZ increments, with the AIO
subsystem merging them into larger IOs. Thomas et al's work around streaming
read make bufmgr.c issue larger IOs - which is good for performance. But it
was surprisingly hard to fit into my older design.

It took me much longer than I had hoped to address these issues in
prototype. In the end I made progress by working on a rewriting the patchset
from scratch (well, with a bit of copy & paste).

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.

While making v2 somewhat presentable I unfortunately found a few more design
issues - they're now mostly resolved, I think. But I only resolved the last
one a few hours ago, who knows what a few nights of sleeping on it will
bring. Unfortunately that prevented me from doing some of the polishing that I
had wanted to finish...

Because of the aforementioned move, I currently do not have access to my
workstation. I just have access to my laptop - which has enough thermal issues
to make benchmarks not particularly reliable.

So here are just a few teaser numbers, on an PCIe v4 NVMe SSD, note however
that this is with the BAS_BULKREAD size increased, with the default 256kB, we
can only keep one IO in flight at a time (due to io_combine_limit building
larger IOs) - we'll need to do something better than this, but that's yet
another separate discussion.

Workload: pg_prewarm('pgbench_accounts') of a scale 5k database, which is
bigger than memory:

time
master: 59.097
aio v2.0, worker: 11.211
aio v2.0, uring *: 19.991
aio v2.0, direct, worker: 09.617
aio v2.0, direct, uring *: 09.802

Workload: SELECT sum(abalance) FROM pgbench_accounts;

0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759

* the reason io_uring is slower is that workers effectively parallelize
*memcpy, at the cost of increased CPU usage
** a simple heuristic to use IOSQE_ASYNC to force some parallelism of memcpys

Workload: checkpointing ~20GB of dirty data, mostly sequential:

time
master: 10.209
aio v2.0, worker: 05.391
aio v2.0, uring: 04.593
aio v2.0, direct, worker: 07.745
aio v2.0, direct, uring: 03.351

To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Besides that, I am planning to introduce "io_method=sync", which will just
execute IO synchrously. Besides that being a good capability to have, it'll
also make it more sensible to split off worker mode support into its own
commit(s).

Greetings,

Andres Freund

[1]: bulk relation extension, streaming read
[2]: personal health challenges, family health challenges and now moving from the US West Coast to the East Coast, ...
the US West Coast to the East Coast, ...

Attachments:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patchtext/x-diff; charset=us-asciiDownload+6-2
v2.0-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload+68-31
v2.0-0003-Use-aux-process-resource-owner-in-walsender.patchtext/x-diff; charset=us-asciiDownload+13-41
v2.0-0004-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload+15-2
v2.0-0005-bufmgr-smgr-Don-t-cross-segment-boundaries-in-S.patchtext/x-diff; charset=us-asciiDownload+55-1
v2.0-0006-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload+176-1
v2.0-0007-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload+196-1
v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload+312-20
v2.0-0009-aio-Basic-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload+3104-11
v2.0-0010-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload+434-3
v2.0-0011-bufmgr-Implement-AIO-support.patchtext/x-diff; charset=us-asciiDownload+519-8
v2.0-0012-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload+182-103
v2.0-0013-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload+27-8
v2.0-0014-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload+232-1
v2.0-0015-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload+580-65
v2.0-0016-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload+1272-3
v2.0-0017-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload+5-2
#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#1)
Re: AIO v2.0

On 01/09/2024 09:27, Andres Freund wrote:

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.

+1 on that approach.

To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Yeah, a high-level README would be nice. Without that, it's hard to
follow what "handed out" and "defined" above means for example.

A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired
the lock, but it doesn't own it anymore. That's reasonable, but maybe
add a boolean to the LWLock to mark whether the lock is currently owned
or not.

The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might
think that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.

v2.0-0003-Use-aux-process-resource-owner-in-walsender.patch

+1. The old comment "We don't currently need any ResourceOwner in a
walsender process" was a bit misleading, because the walsender did
create the short-lived "base backup" resource owner, so it's nice to get
that fixed.

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict
with this [1]/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi. Not in any fundamental way, but can I ask you to review
those patch, please? After those patches, AIO workers should also have
PMChild slots (formerly known as Backend structs).

[1]: /messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#2)
Re: AIO v2.0

Hi,

On 2024-09-02 13:03:07 +0300, Heikki Linnakangas wrote:

On 01/09/2024 09:27, Andres Freund wrote:

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Yeah, a high-level README would be nice. Without that, it's hard to follow
what "handed out" and "defined" above means for example.

Yea - I had actually written a bunch of that before, but then redesigns just
obsoleted most of it :(

FWIW, "handed out" is an IO handle acquired by code, which doesn't yet have an
operation associated with it. Once "defined" it actually could be - but isn't
yet - executed.

A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

Cool

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired the
lock, but it doesn't own it anymore. That's reasonable, but maybe add a
boolean to the LWLock to mark whether the lock is currently owned or not.

Hm, not sure it's worth doing that...

The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might think
that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.

Yea, that makes sense.

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict with
this [1]. Not in any fundamental way, but can I ask you to review those
patch, please? After those patches, AIO workers should also have PMChild
slots (formerly known as Backend structs).

I'll try to do that soonish!

Greetings,

Andres Freund

#4陈宗志
baotiao@gmail.com
In reply to: Andres Freund (#3)
Re: AIO v2.0

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.
For example, which paths were modified in the AIO module? Is it the
path for writing WAL logs, or the path for flushing pages, etc.?

Also, I recommend keeping this patch as small as possible.
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

#5David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#1)
Re: AIO v2.0

On Sun, 1 Sept 2024 at 18:28, Andres Freund <andres@anarazel.de> wrote:

0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759

I took this for a test drive on an AMD 3990x machine with a 1TB
Samsung 980 Pro SSD on PCIe 4. I only tried io_method = io_uring, but
I did try with and without direct IO.

This machine has 64GB RAM and I was using ClickBench Q2 [1]https://github.com/ClickHouse/ClickBench/blob/main/postgresql-tuned/queries.sql, which is
"SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;"
(for some reason they use 0-based query IDs). This table is 64GBs
without indexes.

I'm seeing direct IO slower than buffered IO with smaller worker
counts. That's counter to what I would have expected as I'd have
expected the memcpys from the kernel space to be quite an overhead in
the buffered IO case. With larger worker counts the bottleneck is
certainly disk. The part that surprised me was that the bottleneck is
reached more quickly with buffered IO. I was seeing iotop going up to
5.54GB/s at higher worker counts.

times in milliseconds
workers buffered direct cmp
0 58880 102852 57%
1 33622 53538 63%
2 24573 40436 61%
4 18557 27359 68%
8 14844 17330 86%
16 12491 12754 98%
32 11802 11956 99%
64 11895 11941 100%

Is there some other information I can provide to help this make sense?
(Or maybe it does already to you.)

David

[1]: https://github.com/ClickHouse/ClickBench/blob/main/postgresql-tuned/queries.sql

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: AIO v2.0

Hi,

Attached is the next version of the patchset. Changes:

- added "sync" io method, the main benefit of that is that the main AIO commit
doesn't need to include worker mode

- split worker and io_uring methods into their own commits

- added src/backend/storage/aio/README.md, explaining design constraints and
the resulting design on a high level

- renamed LWLockReleaseOwnership as suggested by Heikki

- a bunch of small cleanups and improvements

There's plenty more to do, but I thought this would be a useful checkpoint.

Greetings,

Andres Freund

Attachments:

v2.1-0012-aio-Add-README.md-explaining-higher-level-desig.patchtext/x-diff; charset=us-asciiDownload+311-1
v2.1-0013-aio-Implement-smgr-md.c-aio-methods.patchtext/x-diff; charset=us-asciiDownload+434-3
v2.1-0014-bufmgr-Implement-AIO-support.patchtext/x-diff; charset=us-asciiDownload+520-8
v2.1-0015-bufmgr-Use-aio-for-StartReadBuffers.patchtext/x-diff; charset=us-asciiDownload+182-103
v2.1-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patchtext/x-diff; charset=us-asciiDownload+6-2
v2.1-0002-Allow-lwlocks-to-be-unowned.patchtext/x-diff; charset=us-asciiDownload+82-31
v2.1-0003-Use-aux-process-resource-owner-in-walsender.patchtext/x-diff; charset=us-asciiDownload+13-41
v2.1-0004-Ensure-a-resowner-exists-for-all-paths-that-may.patchtext/x-diff; charset=us-asciiDownload+15-2
v2.1-0005-bufmgr-smgr-Don-t-cross-segment-boundaries-in-S.patchtext/x-diff; charset=us-asciiDownload+55-1
v2.1-0006-aio-Basic-subsystem-initialization.patchtext/x-diff; charset=us-asciiDownload+192-1
v2.1-0007-aio-Core-AIO-implementation.patchtext/x-diff; charset=us-asciiDownload+2332-2
v2.1-0008-aio-Skeleton-IO-worker-infrastructure.patchtext/x-diff; charset=us-asciiDownload+311-19
v2.1-0009-aio-Add-worker-method.patchtext/x-diff; charset=us-asciiDownload+428-10
v2.1-0010-aio-Add-liburing-dependency.patchtext/x-diff; charset=us-asciiDownload+176-1
v2.1-0011-aio-Add-io_uring-method.patchtext/x-diff; charset=us-asciiDownload+398-1
v2.1-0016-aio-Very-WIP-read_stream.c-adjustments-for-real.patchtext/x-diff; charset=us-asciiDownload+27-8
v2.1-0017-aio-Add-IO-queue-helper.patchtext/x-diff; charset=us-asciiDownload+232-1
v2.1-0018-bufmgr-use-AIO-in-checkpointer-bgwriter.patchtext/x-diff; charset=us-asciiDownload+580-65
v2.1-0019-very-wip-test_aio-module.patchtext/x-diff; charset=us-asciiDownload+1290-3
v2.1-0020-Temporary-Increase-BAS_BULKREAD-size.patchtext/x-diff; charset=us-asciiDownload+5-2
#7Andres Freund
andres@anarazel.de
In reply to: 陈宗志 (#4)
Re: AIO v2.0

Hi,

On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.

Yep, that was already on my todo list. The version I just posted includes
that.

For example, which paths were modified in the AIO module?
Is it the path for writing WAL logs, or the path for flushing pages, etc.?

I don't think it's good to document this in a design document - that's just
bound to get out of date.

For now the patchset causes AIO to be used for

1) all users of read_stream.h, e.g. sequential scans

2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
mentioned in my email upthread, the code for that is in a somewhat rough
shape as Thomas Munro is working on a more general abstraction for some of
this.

The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO. Thus I am
focusing on the above users for now.

Also, I recommend keeping this patch as small as possible.

Yep. That's my goal (as mentioned upthread).

For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.

The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.

Greetings,

Andres Freund

#8Robert Pang
robertpang@google.com
In reply to: Andres Freund (#7)
Re: AIO v2.0

Hi Andres

Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.

2024-09-12 01:38:00.851 PDT [2780939] PANIC: no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG: background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG: terminating any other
active server processes

I debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.

Best regards
Robert

Show quoted text

On Fri, Sep 6, 2024 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-09-05 01:37:34 +0800, 陈宗志 wrote:

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.

Yep, that was already on my todo list. The version I just posted includes
that.

For example, which paths were modified in the AIO module?
Is it the path for writing WAL logs, or the path for flushing pages, etc.?

I don't think it's good to document this in a design document - that's just
bound to get out of date.

For now the patchset causes AIO to be used for

1) all users of read_stream.h, e.g. sequential scans

2) bgwriter / checkpointer, mainly to have way to exercise the write path. As
mentioned in my email upthread, the code for that is in a somewhat rough
shape as Thomas Munro is working on a more general abstraction for some of
this.

The earlier patchset added a lot more AIO uses because I needed to know all
the design constraints. It e.g. added AIO use in WAL. While that allowed me to
learn a lot, it's not something that makes sense to continue working on for
now, as it requires a lot of work that's independent of AIO. Thus I am
focusing on the above users for now.

Also, I recommend keeping this patch as small as possible.

Yep. That's my goal (as mentioned upthread).

For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

Currently the patchset doesn't contain libaio support and I am not planning to
work on using libaio. Nor do I think it makes sense for anybody else to do so
- libaio doesn't work for buffered IO, making it imo not particularly useful
for us.

The io_uring specific code isn't particularly complex / large compared to the
main AIO infrastructure.

Greetings,

Andres Freund

Attachments:

0001-Fix-BgBufferSync-to-limit-batch-size-under-io_bounce.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-BgBufferSync-to-limit-batch-size-under-io_bounce.patchDownload+4-2
#9Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#6)
Re: AIO v2.0

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

+typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md. Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?

+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

+
+	/* IO finished, but result has not yet been processed */
+	AHS_REAPED,
+
+	/* IO completed, shared completion has been called */
+	AHS_COMPLETED_SHARED,
+
+	/* IO completed, local completion has been called */
+	AHS_COMPLETED_LOCAL,
+} PgAioHandleState;
+void
+pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
+{
+	PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, ioh_node);
+
+	Assert(ioh->resowner);
+
+	ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
+	ioh->resowner = NULL;
+
+	switch (ioh->state)
+	{
+		case AHS_IDLE:
+			elog(ERROR, "unexpected");
+			break;
+		case AHS_HANDED_OUT:
+			Assert(ioh == my_aio->handed_out_io || my_aio->handed_out_io == NULL);
+
+			if (ioh == my_aio->handed_out_io)
+			{
+				my_aio->handed_out_io = NULL;
+				if (!on_error)
+					elog(WARNING, "leaked AIO handle");
+			}
+
+			pgaio_io_reclaim(ioh);
+			break;
+		case AHS_DEFINED:
+		case AHS_PREPARED:
+			/* XXX: Should we warn about this when is_commit? */

Yes.

+			pgaio_submit_staged();
+			break;
+		case AHS_IN_FLIGHT:
+		case AHS_REAPED:
+		case AHS_COMPLETED_SHARED:
+			/* this is expected to happen */
+			break;
+		case AHS_COMPLETED_LOCAL:
+			/* XXX: unclear if this ought to be possible? */
+			pgaio_io_reclaim(ioh);
+			break;
+	}
+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup. Even so,
the next point might remove the need here:

+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->state = AHS_DEFINED;
+	ioh->result = 0;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_prepare_subject(ioh);
+
+	ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi

I used the attached makefile patch to build w/ liburing.

+pgaio_uring_shmem_init(bool first_time)
+{
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - MAX_IO_WORKERS;
+	bool		found;
+
+	aio_uring_contexts = (PgAioUringContext *)
+		ShmemInitStruct("AioUring", pgaio_uring_shmem_size(), &found);
+
+	if (found)
+		return;
+
+	for (int contextno = 0; contextno < TotalProcs; contextno++)
+	{
+		PgAioUringContext *context = &aio_uring_contexts[contextno];
+		int			ret;
+
+		/*
+		 * XXX: Probably worth sharing the WQ between the different rings,
+		 * when supported by the kernel. Could also cause additional
+		 * contention, I guess?
+		 */
+#if 0
+		if (!AcquireExternalFD())
+			elog(ERROR, "No external FD available");
+#endif
+		ret = io_uring_queue_init(io_max_concurrency, &context->io_uring_ring, 0);

With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:

2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request

I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+	struct io_uring *uring_instance = &my_shared_uring_context->io_uring_ring;
+
+	Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
+
+	for (int i = 0; i < num_staged_ios; i++)
+	{
+		PgAioHandle *ioh = staged_ios[i];
+		struct io_uring_sqe *sqe;
+
+		sqe = io_uring_get_sqe(uring_instance);
+
+		pgaio_io_prepare_submit(ioh);
+		pgaio_uring_sq_from_io(ioh, sqe);
+	}
+
+	while (true)
+	{
+		int			ret;
+
+		pgstat_report_wait_start(WAIT_EVENT_AIO_SUBMIT);
+		ret = io_uring_submit(uring_instance);
+		pgstat_report_wait_end();
+
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}

Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.

+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
+	int			returnCode;
+	Vfd		   *vfdP;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileStartWriteV: %d (%s) " INT64_FORMAT " %d",
+			   file, VfdCache[file].fileName,
+			   (int64) offset,
+			   iovcnt));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	vfdP = &VfdCache[file];
+
+	/* FIXME: think about / reimplement  temp_file_limit */
+
+	pgaio_io_prep_writev(ioh, vfdP->fd, iovcnt, offset);
+
+	return 0;
+}

FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them. Otherwise,
deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically. Could you add a mention of "deadlock" in the
comment at whichever code achieves that?

I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations. Is there any
specific decision you'd like to settle before patch 6 exits WIP?

Thanks,
nm

Attachments:

uring-makefile-v1.patchtext/plain; charset=us-asciiDownload+4-3
#10Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#9)
Re: AIO v2.0

Hi,

Thanks for the review!

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

Cool!

+typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else. I recommend referring to it in README.md.

Makes sense.

Would you also cover the valid state transitions and which of them any
backend can do vs. which are specific to the defining backend?

Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.

+{
+	/* not in use */
+	AHS_IDLE = 0,
+
+	/* returned by pgaio_io_get() */
+	AHS_HANDED_OUT,
+
+	/* pgaio_io_start_*() has been called, but IO hasn't been submitted yet */
+	AHS_DEFINED,
+
+	/* subjects prepare() callback has been called */
+	AHS_PREPARED,
+
+	/* IO is being executed */
+	AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach. For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.

+void
+pgaio_io_ref_wait(PgAioHandleRef *ior)
+{
+	uint64		ref_generation;
+	PgAioHandleState state;
+	bool		am_owner;
+	PgAioHandle *ioh;
+
+	ioh = pgaio_io_from_ref(ior, &ref_generation);
+
+	am_owner = ioh->owner_procno == MyProcNumber;
+
+
+	if (pgaio_io_was_recycled(ioh, ref_generation, &state))
+		return;
+
+	if (am_owner)
+	{
+		if (state == AHS_DEFINED || state == AHS_PREPARED)
+		{
+			/* XXX: Arguably this should be prevented by callers? */
+			pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both. AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.

That, or not having submitted the IO. One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.

+void
+pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
+{
+	Assert(ioh->state == AHS_HANDED_OUT);
+	Assert(pgaio_io_has_subject(ioh));
+
+	ioh->op = op;
+	ioh->state = AHS_DEFINED;
+	ioh->result = 0;
+
+	/* allow a new IO to be staged */
+	my_aio->handed_out_io = NULL;
+
+	pgaio_io_prepare_subject(ioh);
+
+	ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here. This code already needs to be safe for that (per
README.md). When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common(). I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

Makes sense.

+if test "$with_liburing" = yes; then
+  PKG_CHECK_MODULES(LIBURING, liburing)
+fi

I used the attached makefile patch to build w/ liburing.

Thanks, will incorporate.

With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
fails early:

Right - that's to be expected.

2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: starting PostgreSQL 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 64-bit
2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG: listening on Unix socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
2024-09-15 12:46:08.203 PDT startup[2069423] LOG: database system was shut down at 2024-09-15 12:46:07 PDT
2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL: the database system is starting up
2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG: database system is ready to accept connections
2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC: failed: -9/Bad file descriptor
2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC: failed: -95/Operation not supported
2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC: unexpected: -95/Operation not supported: No such file or directory
2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG: received fast shutdown request

I expect that's from io_uring_queue_init() stashing in shared memory a file
descriptor and mmap address, which aren't valid in EXEC_BACKEND children.
Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner - I don't think there's anything to be
gained by supporting io_uring in this situation. It's not like anybody will
use it for real-world workloads where performance matters. Nor would it be
useful fo portability testing.

+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}

Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.

Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?

+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
...

FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.

For non-sync IO methods, I gather it's essential that a process other than the
IO definer be scanning for incomplete IOs and completing them.

Yep - it's something I've been fighting with / redesigning a *lot*. Earlier
the AIO subsystem could transparently retry IOs, but that ends up being a
nightmare - or at least I couldn't find a way to not make it a
nightmare. There are two main complexities:

1) What if the IO is being completed in a critical section? We can't reopen
the file in that situation. My initial fix for this was to defer retries,
but that's problematic too:

2) Acquiring an IO needs to be able to guarantee forward progress. Because
there's a limited number of IOs that means we need to be able to complete
IOs while acquiring an IO. So we can't just keep the IO handle around -
which in turn means that we'd need to save the state for retrying
somewhere. Which would require some pre-allocated memory to save that
state.

Thus I think it's actually better if we delegate retries to the callsites. I
was thinking that for partial reads of shared buffers we ought to not set
BM_IO_ERROR though...

Otherwise, deadlocks like this would happen:

backend1 locks blk1 for non-IO reasons
backend2 locks blk2, starts AIO write
backend1 waits for lock on blk2 for non-IO reasons
backend2 waits for lock on blk1 for non-IO reasons

If that's right, in worker mode, the IO worker resolves that deadlock. What
resolves it under io_uring? Another process that happens to do
pgaio_io_ref_wait() would dislodge things, but I didn't locate the code to
make that happen systematically.

Yea, it's code that I haven't forward ported yet. I think basically
LockBuffer[ForCleanup] ought to call pgaio_io_ref_wait() when it can't
immediately acquire the lock and if the buffer has IO going on.

I could share more-tactical observations about patches 6-20, but they're
probably things you'd change without those observations.

Agreed.

Is there any specific decision you'd like to settle before patch 6 exits
WIP?

Patch 6 specifically? That I really mainly kept separate for review - it
doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

- The header split doesn't yet quite seem right yet

- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications

- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.

- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.

- I'm not sure that I like name "subject" for the different things AIO is
performed for

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.

- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.

- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.

- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant
slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.

That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.

Thanks again,

Andres Freund

#11Andres Freund
andres@anarazel.de
In reply to: Robert Pang (#8)
Re: AIO v2.0

Hi,

On 2024-09-12 14:55:49 -0700, Robert Pang wrote:

Hi Andres

Thanks for the AIO patch update. I gave it a try and ran into a FATAL
in bgwriter when executing a benchmark.

2024-09-12 01:38:00.851 PDT [2780939] PANIC: no more bbs
2024-09-12 01:38:00.854 PDT [2780473] LOG: background writer process
(PID 2780939) was terminated by signal 6: Aborted
2024-09-12 01:38:00.854 PDT [2780473] LOG: terminating any other
active server processes

I debugged a bit and found that BgBufferSync() is not capping the
batch size under io_bounce_buffers like BufferSync() for checkpoint.
Here is a small patch to fix it.

Good catch, thanks!

I am hoping (as described in my email to Noah a few minutes ago) that we can
get away from needing bounce buffers. They are a quite expensive solution to a
problem we made for ourselves...

Greetings,

Andres Freund

#12Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#10)
Re: AIO v2.0

On Mon, Sep 16, 2024 at 01:51:42PM -0400, Andres Freund wrote:

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:

On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:

Reattaching descriptors and memory in each child may work, or one could just
block io_method=io_uring under EXEC_BACKEND.

I think the latter option is saner

Works for me.

+pgaio_uring_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
+{
+		if (ret == -EINTR)
+		{
+			elog(DEBUG3, "submit EINTR, nios: %d", num_staged_ios);
+			continue;
+		}

Since io_uring_submit() is a wrapper around io_uring_enter(), this should also
retry on EAGAIN. "man io_uring_enter" has:

EAGAIN The kernel was unable to allocate memory for the request, or
otherwise ran out of resources to handle it. The application should wait
for some completions and try again.

Hm. I'm not sure that makes sense. We only allow a limited number of IOs to be
in flight for each uring instance. That's different to a use of uring to
e.g. wait for incoming network data on thousands of sockets, where you could
have essentially unbounded amount of requests outstanding.

What would we wait for? What if we were holding a critical lock in that
moment? Would it be safe to just block for some completions? What if there's
actually no IO in progress?

I'd try the following. First, scan for all IOs of all processes at
AHS_DEFINED and later, advancing them to AHS_COMPLETED_SHARED. This might be
unsafe today, but discovering why it's unsafe likely will inform design beyond
EAGAIN returns. I don't specifically know of a way it's unsafe. Do just one
pass of that; there may be newer IOs in progress afterward. If submit still
gets EAGAIN, sleep a bit and retry. Like we do in pgwin32_open_handle(), fail
after a fixed number of iterations. This isn't great if we hold a critical
lock, but it beats the alternative of PANIC on the first EAGAIN.

+FileStartWriteV(struct PgAioHandle *ioh, File file,
+				int iovcnt, off_t offset,
+				uint32 wait_event_info)
+{
...

FileStartWriteV() gets to state AHS_PREPARED, so let's align with the state
name by calling it FilePrepareWriteV (or FileWriteVPrepare or whatever).

Hm - that doesn't necessarily seem right to me. I don't think the caller
should assume that the IO will just be prepared and not already completed by
the time FileStartWriteV() returns - we might actually do the IO
synchronously.

Yes. Even if it doesn't become synchronous IO, some other process may advance
the IO to AHS_COMPLETED_SHARED by the next wake-up of the process that defined
the IO. Still, I think this shouldn't use the term "Start" while no state
name uses that term. What else could remove that mismatch?

Is there any specific decision you'd like to settle before patch 6 exits
WIP?

Patch 6 specifically? That I really mainly kept separate for review - it

No. I'll rephrase as "Is there any specific decision you'd like to settle
before the next cohort of patches exits WIP?"

doesn't seem particularly interesting to commit it earlier than 7, or do you
think differently?

No, I agree a lone commit of 6 isn't a win. Roughly, the eight patches
6-9,12-15 could be a minimal attractive unit. I've not thought through that
grouping much.

In case you mean 6+7 or 6 to ~11, I can think of the following:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

- The header split doesn't yet quite seem right yet

I won't have a strong opinion on that one. The aio.c/aio_io.c split did catch
my attention. I made a note to check it again once those files get header
comments.

- I'd like to implement retries in the later patches, to make sure that it
doesn't have design implications

Yes, that's a blocker to me.

- Worker mode needs to be able to automatically adjust the number of running
workers, I think - otherwise it's going to be too hard to tune.

Changing that later wouldn't affect much else, so I'd not consider it a
blocker. (The worst case is that we think the initial AIO release will be a
loss for most users, so we wrap it in debug_ terminology like we did for
debug_io_direct. I'm not saying worker scaling will push AIO from one side of
that line to another, but that's why I'm fine with commits that omit
self-contained optimizations.)

- I think the PgAioHandles need to be slimmed down a bit - there's some design
evolution visible that should not end up in the tree.

Okay.

- I'm not sure that I like name "subject" for the different things AIO is
performed for

How about one of these six terms:

- listener, observer [if you view smgr as an observer of IOs in the sense of https://en.wikipedia.org/wiki/Observer_pattern]
- class, subclass, type, tag [if you view an SmgrIO as a subclass of an IO, in the object-oriented sense]

- I am wondering if the need for pgaio_io_set_io_data_32() (to store the set
of buffer ids that are affected by one IO) could be replaced by repurposing
BufferDesc->freeNext or something along those lines. I don't like the amount
of memory required for storing those arrays, even if it's not that much
compared to needing space to store struct iovec[PG_IOV_MAX] for each AIO
handle.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker.

- I'd like to extend the test module to actually test more cases, it's too
hard to reach some paths, particularly without [a lot] of users yet. That's
not strictly a dependency of the earlier patches - since the initial patches
can't actually do much in the way of IO.

Agreed. Among the post-patch check-world coverage, which uncovered parts have
the most risk?

- We shouldn't reserve AioHandles etc for io workers - but because different
tpes of aux processes don't use a predetermined ProcNumber, that's not
entirely trivial without adding more complexity. I've actually wondered
whether IO workes should be their own "top-level" kind of process, rather
than an aux process. But that seems quite costly.

Here too, changing that later wouldn't affect much else, so I'd not consider
it a blocker. Of these ones I'm calling non-blockers, which would you most
regret deferring?

- Right now the io_uring mode has each backend's io_uring instance visible to
each other process. That ends up using a fair number of FDs. That's OK from
an efficiency perspective, but I think we'd need to add code to adjust the
soft RLIMIT_NOFILE (it's set to 1024 on most distros because there are
various programs that iterate over all possible FDs, causing significant

Agreed on raising the soft limit. Docs and/or errhint() likely will need to
mention system configuration nonetheless, since some users will encounter
RLIMIT_MEMLOCK or /proc/sys/kernel/io_uring_disabled.

slowdowns when the soft limit defaults to something high). I earlier had a
limited number of io_uring instances, but that added a fair amount of
overhead because then submitting IO would require a lock.

That again doesn't have to be solved as part of the earlier patches but
might have some minor design impact.

How far do you see the design impact spreading on that one?

Thanks,
nm

#13Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#12)
Re: AIO v2.0

Hi,

On 2024-09-17 11:08:19 -0700, Noah Misch wrote:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.

Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.

Of course this could be addressed by tuning, but it seems like something that
shouldn't need to be tuned by the majority of folks running postgres.

We also discussed the topic at /messages/by-id/20240925020022.c5.nmisch@google.com

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.

However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.

Does that seem reasonable?

Greetings,

Andres Freund

#14Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#13)
Re: AIO v2.0

On Mon, 30 Sept 2024 at 16:49, Andres Freund <andres@anarazel.de> wrote:

On 2024-09-17 11:08:19 -0700, Noah Misch wrote:

- I am worried about the need for bounce buffers for writes of checksummed
buffers. That quickly ends up being a significant chunk of memory,
particularly when using a small shared_buffers with a higher than default
number of connection. I'm currently hacking up a prototype that'd prevent us
from setting hint bits with just a share lock. I'm planning to start a
separate thread about that.

AioChooseBounceBuffers() limits usage to 256 blocks (2MB) per MaxBackends.
Doing better is nice, but I don't consider this a blocker. I recommend
dealing with the worry by reducing the limit initially (128 blocks?). Can
always raise it later.

On storage that has nontrivial latency, like just about all cloud storage,
even 256 will be too low. Particularly for checkpointer.

Assuming 1ms latency - which isn't the high end of cloud storage latency - 256
blocks in flight limits you to <= 256MByte/s, even on storage that can have a
lot more throughput. With 3ms, which isn't uncommon, it's 85MB/s.

FYI, I think you're off by a factor 8, i.e. that would be 2GB/sec and
666MB/sec respectively, given a normal page size of 8kB and exactly
1ms/3ms full round trip latency:

1 page/1 ms * 8kB/page * 256 concurrency = 256 pages/ms * 8kB/page =
2MiB/ms ~= 2GiB/sec.
for 3ms divide by 3 -> ~666MiB/sec.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#15Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#13)
Re: AIO v2.0

On Mon, Sep 30, 2024 at 10:49:17AM -0400, Andres Freund wrote:

We also discussed the topic at /messages/by-id/20240925020022.c5.nmisch@google.com

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

Given the issues that modifying pages while in flight causes, not just with PG
level checksums, but also filesystem level checksum, I don't feel like it's a
particularly promising approach.

However, I think this doesn't have to mean that the BM_SETTING_HINTS stuff has
to be completed before we can move forward with AIO. If I split out the write
portion from the read portion a bit further, the main AIO changes and the
shared-buffer read user can be merged before there's a dependency on the hint
bit stuff being done.

Does that seem reasonable?

Yes.

#16Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#6)
Re: AIO v2.0

On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

Attached is the next version of the patchset. (..)

Hi Andres,

Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:

0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?
Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?
b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?

1. In [0]/messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to each other process.(..)" you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1]/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)")

So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?

3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future? To
add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while the
earlier mentioned AioChooseMaxConccurrency() goes up to just 64

4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?

5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0

Or is that how it is supposed to work? Also pg_stat_io.read_time would be
something vague with io_uring/worker, so maybe zero is good here (?).
Otherwise we would have to measure time spent on waiting alone, but that
would force more instructions for calculating io times...

6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data). One issue I have
found is that DIO seems to be unusable until somebody will teach TOAST to
use readstreams, is that correct? Maybe I'm doing something wrong, but I
haven't seen any TOAST <-> readstreams topic:

-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)

hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.

The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?

8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)

9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ? As a debug feature
or as non-debug feature? (that is which of the IO methods should be
scrutinized the most as it is going to be the new default - sync or worker?)

10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)

-J.

p.s. I hope I did not ask stupid questions nor missed anything.

[0]: /messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf - "Right now the io_uring mode has each backend's io_uring instance visible to each other process.(..)"
/messages/by-id/237y5rabqim2c2v37js53li6i34v2525y2baf32isyexecn4ic@bqmlx5mrnwuf
- "Right now the io_uring mode has each backend's io_uring instance visible
to
each other process.(..)"

[1]: /messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com - sentence after: "FWIW there's another bottleneck people may not realize (..)"
/messages/by-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
- sentence after: "FWIW there's another bottleneck people may not realize
(..)"

[2]: /messages/by-id/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk@xk7jy7fnua6w
/messages/by-id/x3f32prdpgalmiieyialqtn53j5uvb2e4c47nvnjetkipq3zyk@xk7jy7fnua6w

[3]: /messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com
/messages/by-id/CAKZiRmwrBjCbCJ433wV5zjvwt_OuY7BsVX12MBKiBu+eNZDm6g@mail.gmail.com

#17Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#16)
Re: AIO v2.0

Hi,

Sorry for loosing track of your message for this long, I saw it just now
because I was working on posting a new version.

On 2024-11-18 13:19:58 +0100, Jakub Wartak wrote:

On Fri, Sep 6, 2024 at 9:38 PM Andres Freund <andres@anarazel.de> wrote:
Thank You for worth admiring persistence on this. Please do not take it as
criticism, just more like set of questions regarding the patchset v2.1 that
I finally got little time to play with:

0. Doesn't the v2.1-0011-aio-Add-io_uring-method.patch -> in
pgaio_uring_submit() -> io_uring_get_sqe() need a return value check ?

Yea, it shouldn't ever happen, but it's worth adding a check.

Otherwise we'll never know that SQ is full in theory, perhaps at least such
a check should be made with Assert() ? (I understand right now that we
allow just up to io_uring_queue_init(io_max_concurrency), but what happens
if:
a. previous io_uring_submit() failed for some reason and we do not have
free space for SQ?

We'd have PANICed at that failure :)

b. (hypothetical) someday someone will try to make PG multithreaded and the
code starts using just one big queue, still without checking for
io_uring_get_sqe()?

That'd not make sense - you'd still want to use separate rings, to avoid
contention.

1. In [0] you wrote that there's this high amount of FDs consumed for
io_uring (dangerously close to RLIMIT_NOFILE). I can attest that there are
many customers who are using extremely high max_connections (4k-5k, but
there outliers with 10k in the wild too) - so they won't even start - and I
have one doubt on the user-friendliness impact of this. I'm quite certain
it's going to be the same as with pgbouncer where one is forced to tweak
OS(systemd/pam/limits.conf/etc), but in PG we are better because PG tries
to preallocate and then close() a lot of FDs, so that's safer in runtime.
IMVHO even if we just consume e.g. say > 30% of FDs just for io_uring, the
max_files_per_process looses it's spirit a little bit and PG is going to
start loose efficiency too due to frequent open()/close() calls as fd cache
is too small. Tomas also complained about it some time ago in [1])

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

In most distros the soft ulimit is set to something like 1024, but the hard
limit is much higher. The reason for that is that some applications try to
close all fds between 0 and RLIMIT_NOFILE - which takes a long time if
RLIMIT_NOFILE is high. By setting only the soft limit to a low value any
application needing higher limits can just opt into using more FDs.

On several of my machines the hard limit is 1073741816.

So maybe it would be good to introduce couple of sanity checks too (even
after setting higher limit):
- issue FATAL in case of using io_method = io_ring && max_connections would
be close to getrusage(RLIMIT_NOFILE)
- issue warning in case of using io_method = io_ring && we wouldnt have
even real 1k FDs free for handling relation FDs (detect something bad like:
getrusage(RLIMIT_NOFILE) <= max_connections + max_files_per_process)

Probably still worth adding something like this, even if we were to do what I
am suggesting above.

2. In pgaio_uring_postmaster_child_init_local() there
"io_uring_queue_init(32,...)" - why 32? :) And also there's separate
io_uring_queue_init(io_max_concurrency) which seems to be derived from
AioChooseMaxConccurrency() which can go up to 64?

Yea, that's probably not right.

3. I find having two GUCs named literally the same
(effective_io_concurrency, io_max_concurrency). It is clear from IO_URING
perspective what is io_max_concurrency all about, but I bet having also
effective_io_concurrency in the mix is going to be a little confusing for
users (well, it is to me). Maybe that README.md could elaborate a little
bit on the relation between those two? Or maybe do you plan to remove
io_max_concurrency and bind it to effective_io_concurrency in future?

io_max_concurrency is a hard maximum that needs to be set at server start,
because it requires allocating shared memory. Whereas effective_io_concurrency
can be changed on a per-session and per-tablespace
basis. I.e. io_max_concurrency is a hard upper limit for an entire backend,
whereas effective_io_concurrency controls how much one scan (or whatever does
prefetching) can issue.

To add more fun , there's MAX_IO_CONCURRENCY nearby in v2.1-0014 too while
the earlier mentioned AioChooseMaxConccurrency() goes up to just 64

Yea, that should probably be disambiguated.

4. While we are at this, shouldn't the patch rename debug_io_direct to
simply io_direct so that GUCs are consistent in terms of naming?

I used to have a patch like that in the series and it was a pain to
rebase...

I also suspect sure this is quite enough to make debug_io_direct quite
production ready, even if just considering io_direct=data. Without streaming
read use in heap + index VACUUM, RelationCopyStorage() and a few other places
the performance consequences of using direct IO can be, um, surprising.

5. It appears that pg_stat_io.reads seems to be not refreshed until they
query seems to be finished. While running a query for minutes with this
patchset, I've got:
now | reads | read_time
-------------------------------+----------+-----------
2024-11-15 12:09:09.151631+00 | 15004271 | 0
[..]
2024-11-15 12:10:25.241175+00 | 15004271 | 0
2024-11-15 12:10:26.241179+00 | 15004271 | 0
2024-11-15 12:10:27.241139+00 | 18250913 | 0

Or is that how it is supposed to work?

Currently the patch has a FIXME to add some IO statistics (I think I raised
that somewhere in this thread, too). It's not clear to me what IO time ought
to mean. I suspect the least bad answer is what you suggest:

Also pg_stat_io.read_time would be something vague with io_uring/worker, so
maybe zero is good here (?). Otherwise we would have to measure time spent
on waiting alone, but that would force more instructions for calculating io
times...

I.e. we should track the amount of time spent waiting for IOs.

I don't think tracking time in worker or such would make much sense, that'd
often end up with reporting more IO time than a query took.

6. After playing with some basic measurements - which went fine, I wanted
to go test simple PostGIS even with sequential scans to see any
compatibility issues (AFAIR Thomas Munro on PGConfEU indicated as good
testing point), but before that I've tried to see what's the TOAST
performance alone with AIO+DIO (debug_io_direct=data).

It's worth noting that with the last posted version you needed to increase
effective_io_concurrency to something very high to see sensible
performance.

That's due to the way read_stream_begin_impl() limited the number of buffers
pinned to effective_io_concurrency * 4 - which, due to io_combine_limit, ends
up allowing only a single IO in flight in case of sequential blocks until
effective_io_concurrency is set to 8 or such. I've adjusted that to some
degree now, but I think that might need a bit more sophistication.

One issue I have found is that DIO seems to be unusable until somebody will
teach TOAST to use readstreams, is that correct? Maybe I'm doing something
wrong, but I haven't seen any TOAST <-> readstreams topic:

Hm, I suspect that aq read stream won't help a whole lot in manyq toast
cases. Unless you have particularly long toast datums, the time is going to be
dominated by the random accesses, as each toast datum is looked up in a
non-predictable way.

Generally, using DIO requires tuning shared buffers much more aggressively
than not using DIO, no amount of stream use will change that. Of course we
shoul try to reduce that "downside"...

I'm not sure if the best way to do prefetching toast chunks would be to rely
on more generalized index->table prefetching support, or to have dedicated code.

-- 12MB table , 25GB toast
create table t (id bigint, t text storage external);
insert into t select i::bigint as id, repeat(md5(i::text),4000)::text as r
from generate_series(1,200000) s(i);
set max_parallel_workers_per_gather=0;
\timing
-- with cold caches: empty s_b, echo 3 > drop_caches
select sum(length(t)) from t;
master 101897.823 ms (01:41.898)
AIO 99758.399 ms (01:39.758)
AIO+DIO 191479.079 ms (03:11.479)

hotpath was detoast_attr() -> toast_fetch_datum() ->
heap_fetch_toast_slice() -> systable_getnext_ordered() ->
index_getnext_slot() -> index_fetch_heap() -> heapam_index_fetch_tuple() ->
ReadBufferExtended -> AIO code.

The difference is that on cold caches with DIO gets 2x slowdown; with clean
s_b and so on:
* getting normal heap data seqscan: up to 285MB/s
* but TOASTs maxes out at 132MB/s when using io_uring+DIO

I started loading the data to try this out myself :).

Not about patch itself, but questions about related stack functionality:
----------------------------------------------------------------------------------------------------

7. Is pg_stat_aios still on the table or not ? (AIO 2021 had it). Any hints
on how to inspect real I/O calls requested to review if the code is issuing
sensible calls: there's no strace for uring, or do you stick to DEBUG3 or
perhaps using some bpftrace / xfsslower is the best way to go ?

I think we still want something like it, but I don't think it needs to be in
the initial commits.

There are kernel events that you can track using e.g. perf. Particularly
useful are
io_uring:io_uring_submit_req
io_uring:io_uring_complete

8. Not sure if that helps, but I've managed the somehow to hit the
impossible situation You describe in pgaio_uring_submit() "(ret !=
num_staged_ios)", but I had to push urings really hard into using futexes
and probably I've could made some error in coding too for that too occur
[3]. As it stands in that patch from my thread, it was not covered: /*
FIXME: fix ret != submitted ?! seems like bug?! */ (but i had that hit that
code-path pretty often with 6.10.x kernel)

I think you can hit that if you don't take care to limit the number of IOs
being submitted at once or if you're not consuming completions. If the
completion queue is full enough the kernel at some point won't allow more IOs
to be submitted.

9. Please let me know, what's the current up to date line of thinking about
this patchset: is it intended to be committed as v18 ?

I'd love to get some of it into 18. I don't quite know whether we can make it
happen and to what extent.

As a debug feature or as non-debug feature? (that is which of the IO methods
should be scrutinized the most as it is going to be the new default - sync
or worker?)

I'd say initially worker, with a beta 1 or 2 checklist item to revise it.

10. At this point, does it even make sense to give a try experimenty try to
pwritev2() with RWF_ATOMIC? (that thing is already in the open, but XFS is
going to cover it with 6.12.x apparently, but I could try with some -rcX)

I don't think that's worth doing right now. There's too many dependencies and
it's going to be a while till the kernel support for that is widespread enough
to matter.

There's also the issue that, to my knowledge, outside of cloud environments
there's pretty much no hardware that actually reports power-fail atomicity
sizes bigger than a sector.

p.s. I hope I did not ask stupid questions nor missed anything.

You did not!

Greetings,

Andres Freund

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#17)
Re: AIO v2.0

Andres Freund <andres@anarazel.de> writes:

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources. If ENFILE
weren't a thing, it'd be all right, but that is a thing. We have no
business trying to consume resources the DBA didn't tell us we could.

regards, tom lane

#19Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#18)
Re: AIO v2.0

Hi,

On 2024-12-19 17:34:29 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

My current thoughts around this are that we should generally, independent of
io_uring, increase the FD limit ourselves.

I'm seriously down on that, because it amounts to an assumption that
we own the machine and can appropriate all its resources. If ENFILE
weren't a thing, it'd be all right, but that is a thing. We have no
business trying to consume resources the DBA didn't tell us we could.

Arguably the configuration *did* tell us, by having a higher hard limit...

I'm not saying that we should increase the limit without a bound or without a
configuration option, btw.

As I had mentioned, the problem with relying on increasing the soft limit that
is that it's not generally sensible to do so, because it causes a bunch of
binaries to do be weirdly slow.

Another reason to not increase the soft rlimit is that doing so can break
programs relying on select().

But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Greetings,

Andres Freund

#20Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Andres Freund (#19)
Re: AIO v2.0

On Fri, 20 Dec 2024 at 01:54, Andres Freund <andres@anarazel.de> wrote:

Arguably the configuration *did* tell us, by having a higher hard limit...
<snip>
But opting into a higher rlimit, while obviously adhering to the hard limit
and perhaps some other config knob, seems fine?

Yes, totally fine. That's exactly the reasoning why the hard limit is
so much larger than the soft limit by default on systems with systemd:

https://0pointer.net/blog/file-descriptor-limits.html

#21Andres Freund
andres@anarazel.de
In reply to: Jelte Fennema-Nio (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#23Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
#24Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#22)
#25Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#22)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#22)
#28Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#26)
#29Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#27)
#30Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#25)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#28)
#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#28)
#33Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#23)
#34Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#32)
#36Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#31)
#37Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#34)
#38Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#37)
#39Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#38)
#40Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#39)
#41Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#36)
#43Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#43)
#45Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#46Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#45)
#47Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#45)
#48Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#47)
#49Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#46)
#50James Hunter
james.hunter.pg@gmail.com
In reply to: Thomas Munro (#47)
#51Robert Haas
robertmhaas@gmail.com
In reply to: James Hunter (#50)
#52Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#47)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#52)
#54James Hunter
james.hunter.pg@gmail.com
In reply to: Andres Freund (#52)
#55Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#49)
#56Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#53)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#56)
#58Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#59Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#27)
#60Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#58)
#61Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#58)
#62Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#61)
#63Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#62)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#61)
#65Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#64)
#66Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#63)
#67Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#66)
#68Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#65)
#69Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#68)
#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#69)
#71Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#61)
#72Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#71)
#73Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#72)
#74Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#73)
#75Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#71)
#76Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#75)
#77Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#76)
#78Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#76)
#79Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#77)
#80Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#79)
#81Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#80)
#82Andres Freund
andres@anarazel.de
In reply to: Antonin Houska (#81)
#83Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#82)
#84Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#82)
#85Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#82)
#86Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#84)
#87Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#86)
#88Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#87)
#89Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#86)
#90Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#86)
#91Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#90)
#92Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#86)
#93Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#92)
#94Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#93)
#95Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Andres Freund (#86)
#96Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#93)
#97Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#96)
#98Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#94)
#99Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#98)
#100Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#86)
#101Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#100)
#102Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#100)
#103Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#101)
#104Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#103)
#105Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#102)
#106Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#105)
#107Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#104)
#108Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#100)
#109Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#107)
#110Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#108)
#111Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#110)
#112Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#111)
#113Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#112)
#114Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#112)
#115Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#100)
#116Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#106)
#117Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#100)
#118Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#114)
#119Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#117)
#120Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#115)
#121Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#118)
#122Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#119)
#123Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#121)
#124Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#120)
#125Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#122)
#126Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#123)
#127Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#125)
#128Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#126)
#129Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#128)
#130Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#129)
#131Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#130)
#132Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#131)
#133Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#132)
#134Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#127)
#135Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#115)
#136Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#134)
#137Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#115)
#138Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#135)
#139Thom Brown
thom@linux.com
In reply to: Andres Freund (#115)
#140Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#138)
#141Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#137)
#142Andres Freund
andres@anarazel.de
In reply to: Thom Brown (#139)
#143Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#141)
#144Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#115)
#145Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#144)
#146Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#145)
#147Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#146)
#148Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#147)
#149Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#147)
#150Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#149)
#151Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#150)
#152Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#151)
#153Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#152)
#154Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#152)
#155Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#154)
#156Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#155)
#157Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#125)
#158Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#143)
#159Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#154)
#160Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#159)
#161Aleksander Alekseev
aleksander@timescale.com
In reply to: Andres Freund (#160)
#162Andres Freund
andres@anarazel.de
In reply to: Aleksander Alekseev (#161)
#163Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#160)
#164Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#163)
#165Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#164)
#166Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#165)
#167Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#164)
#168Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#167)
#169Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#168)
#170Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#169)
#171Ranier Vilela
ranier.vf@gmail.com
In reply to: Andres Freund (#170)
#172Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#167)
#173Andres Freund
andres@anarazel.de
In reply to: Ranier Vilela (#171)
#174Ranier Vilela
ranier.vf@gmail.com
In reply to: Andres Freund (#173)
#175Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#172)
#176Andres Freund
andres@anarazel.de
In reply to: Ranier Vilela (#174)
#177Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#175)
#178Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#177)
#179Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#178)
#180Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#179)
#181Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#170)
#182Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#181)
#183Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#182)
#184Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#180)
#185Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#183)
#186Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#185)
#187Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#186)
#188Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#159)
#189Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#86)
#190Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#188)
#191Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#190)
#192Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#187)
#193Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#187)
#194Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#193)
#195Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#194)
#196Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#189)
#197Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#195)
#198Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#197)
#199Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#82)
#200Andres Freund
andres@anarazel.de
In reply to: Antonin Houska (#199)
#201Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#200)
#202Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#201)
#203Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#202)
#204Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#203)
#205Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#196)
#206Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#205)
#207Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#205)
#208Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#206)
#209Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#208)
#210Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#207)
#211Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#203)
#212Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#210)
#213Nathan Bossart
nathandbossart@gmail.com
In reply to: Tomas Vondra (#212)