AIO v2.0

Started by Andres Freundalmost 2 years ago213 messageshackers

andres@anarazel.de

almost 2 years ago

Hi,

It's been quite a while since the last version of the AIO patchset that I have
posted. Of course parts of the larger project have since gone upstream [1]bulk relation extension, streaming read.

A lot of time since the last versions was spent understanding the performance
characteristics of using AIO with WAL and understanding some other odd
performance characteristics I didn't understand. I think I mostly understand
that now and what the design implications for an AIO subsystem are.

The prototype I had been working on unfortunately suffered from a few design
issues that weren't trivial to fix.

The biggest was that each backend could essentially have hard references to
unbounded numbers of "AIO handles" and that these references prevented these
handles from being reused. Because "AIO handles" have to live in shared memory
(so other backends can wait on them, that IO workers can perform them, etc)
that's obviously an issue. There was always a way to just run out of AIO
handles. I went through quite a few iterations of a design for how to resolve
that - I think I finally got there.

Another significant issue was that when I wrote the AIO prototype,
bufmgr.c/smgr.c/md.c only issued IOs in BLCKSZ increments, with the AIO
subsystem merging them into larger IOs. Thomas et al's work around streaming
read make bufmgr.c issue larger IOs - which is good for performance. But it
was surprisingly hard to fit into my older design.

It took me much longer than I had hoped to address these issues in
prototype. In the end I made progress by working on a rewriting the patchset
from scratch (well, with a bit of copy & paste).

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.

While making v2 somewhat presentable I unfortunately found a few more design
issues - they're now mostly resolved, I think. But I only resolved the last
one a few hours ago, who knows what a few nights of sleeping on it will
bring. Unfortunately that prevented me from doing some of the polishing that I
had wanted to finish...

Because of the aforementioned move, I currently do not have access to my
workstation. I just have access to my laptop - which has enough thermal issues
to make benchmarks not particularly reliable.

So here are just a few teaser numbers, on an PCIe v4 NVMe SSD, note however
that this is with the BAS_BULKREAD size increased, with the default 256kB, we
can only keep one IO in flight at a time (due to io_combine_limit building
larger IOs) - we'll need to do something better than this, but that's yet
another separate discussion.

Workload: pg_prewarm('pgbench_accounts') of a scale 5k database, which is
bigger than memory:

time
master: 59.097
aio v2.0, worker: 11.211
aio v2.0, uring *: 19.991
aio v2.0, direct, worker: 09.617
aio v2.0, direct, uring *: 09.802

Workload: SELECT sum(abalance) FROM pgbench_accounts;

0 workers 1 worker 2 workers 4 workers
master: 65.753 33.246 21.095 12.918
aio v2.0, worker: 21.519 12.636 10.450 10.004
aio v2.0, uring*: 31.446 17.745 12.889 10.395
aio v2.0, uring** 23.497 13.824 10.881 10.589
aio v2.0, direct, worker: 22.377 11.989 09.915 09.772
aio v2.0, direct, uring*: 24.502 12.603 10.058 09.759

* the reason io_uring is slower is that workers effectively parallelize
*memcpy, at the cost of increased CPU usage
** a simple heuristic to use IOSQE_ASYNC to force some parallelism of memcpys

Workload: checkpointing ~20GB of dirty data, mostly sequential:

time
master: 10.209
aio v2.0, worker: 05.391
aio v2.0, uring: 04.593
aio v2.0, direct, worker: 07.745
aio v2.0, direct, uring: 03.351

To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Besides that, I am planning to introduce "io_method=sync", which will just
execute IO synchrously. Besides that being a good capability to have, it'll
also make it more sensible to split off worker mode support into its own
commit(s).

Greetings,

Andres Freund

[1]: bulk relation extension, streaming read
[2]: personal health challenges, family health challenges and now moving from the US West Coast to the East Coast, ...
the US West Coast to the East Coast, ...

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 2 years ago

In reply to: Andres Freund (#1)

Re: AIO v2.0

On 01/09/2024 09:27, Andres Freund wrote:

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.

+1 on that approach.

To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle. This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).

In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).

Yeah, a high-level README would be nice. Without that, it's hard to
follow what "handed out" and "defined" above means for example.

A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired
the lock, but it doesn't own it anymore. That's reasonable, but maybe
add a boolean to the LWLock to mark whether the lock is currently owned
or not.

The LWLockReleaseOwnership() name is a bit confusing together with
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might
think that they all release the lock, but LWLockReleaseOwnership() just
disassociates it from the current process. Rename it to LWLockDisown()
perhaps.

v2.0-0003-Use-aux-process-resource-owner-in-walsender.patch

+1. The old comment "We don't currently need any ResourceOwner in a
walsender process" was a bit misleading, because the walsender did
create the short-lived "base backup" resource owner, so it's nice to get
that fixed.

v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict
with this [1]/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi. Not in any fundamental way, but can I ask you to review
those patch, please? After those patches, AIO workers should also have
PMChild slots (formerly known as Backend structs).

[1]: /messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
/messages/by-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi

--
Heikki Linnakangas
Neon (https://neon.tech)

AIO v2.0

Attachments:

Attachments:

Attachments:

Attachments: