Direct I/O
Hi,
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
AIO patch-set[1]https://wiki.postgresql.org/wiki/AIO. It adds three new settings, defaulting to off:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation
O_DIRECT asks the kernel to avoid caching file data as much as
possible. Here's a fun quote about it[2]https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics:
"The exact semantics of Direct I/O (O_DIRECT) are not well specified.
It is not a part of POSIX, or SUS, or any other formal standards
specification. The exact meaning of O_DIRECT has historically been
negotiated in non-public discussions between powerful enterprise
database companies and proprietary Unix systems, and its behaviour has
generally been passed down as oral lore rather than as a formal set of
requirements and specifications."
It gives the kernel the opportunity to move data directly between
PostgreSQL's user space buffers and the storage hardware using DMA
hardware, that is, without CPU involvement or copying. Not all
storage stacks can do that, for various reasons, but even if not, the
caching policy should ideally still use temporary buffers and avoid
polluting the page cache.
These settings currently destroy performance, and are not intended to
be used by end-users, yet! That's why we filed them under
DEVELOPER_OPTIONS. You don't get automatic read-ahead, concurrency,
clustering or (of course) buffering from the kernel. The idea is that
later parts of the AIO patch-set will introduce mechanisms to replace
what the kernel is doing for us today, and then more, since we ought
to be even better at predicting our own future I/O than it, so that
we'll finish up ahead. Even with all that, you wouldn't want to turn
it on by default because the default shared_buffers would be
insufficient for any real system, and there are portability problems.
Examples of slowness:
* every 8KB sequential read or write becomes a full round trip to the
storage, one at a time
* data that is written to WAL and then read back in by WAL sender will
incur full I/O round trip (that's probably not really an AIO problem,
that's something we should probably address by using shared memory
instead of files, as noted as a TODO item in the source code)
Memory alignment patches:
Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.
Assorted portability notes:
I expect this to "work" (that is, successfully destroy performance) on
typical developer systems running at least Linux, macOS, Windows and
FreeBSD. By work, I mean: not be rejected by PostgreSQL, not be
rejected by the kernel, and influence kernel cache behaviour on common
filesystems. It might be rejected with ENOSUPP, EINVAL etc on some
more exotic filesystems and OSes. Of currently supported OSes, only
OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the
GUCs. For macOS and Windows we internally translate our own
PG_O_DIRECT flag to the correct flags/calls (committed a while
back[3]/messages/by-id/CA+hUKG+ADiyyHe0cun2wfT+SVnFVqNYPxoO6J9zcZkVO7+NGig@mail.gmail.com).
On Windows, scatter/gather is available only with direct I/O, so a
true pwritev would in theory be possible, but that has some more
complications and is left for later patches (probably using native
interfaces, not disguising as POSIX).
There may be systems on which 8KB offset alignment will not work at
all or not work well, and that's expected. For example, BTRFS, ZFS,
JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an
8KB write will have to trigger a read-before-write. Note that
offset/length alignment requirements (blocks) are independent of
buffer alignment requirements (memory pages, 4KB).
The behaviour and cache coherency of files that have open descriptors
using both direct and non-direct flags may be complicated and vary
between systems. The patch currently lets you change the GUCs at
runtime so backends can disagree: that should probably not be allowed,
but is like that now for experimentation. More study is required.
If someone has a compiler that we don't know how to do
pg_attribute_aligned() for, then we can't make correctly aligned stack
buffers, so in that case direct I/O is disabled, but I don't know of
such a system (maybe aCC, but we dropped it). That's why smgr code
can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and
why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned()
macro, disabling the GUCs.
This seems to be an independent enough piece to get into the tree on
its own, with the proviso that it's not actually useful yet other than
for experimentation. Thoughts?
These patches have been hacked on at various times by Andres Freund,
David Rowley and me.
[1]: https://wiki.postgresql.org/wiki/AIO
[2]: https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics
[3]: /messages/by-id/CA+hUKG+ADiyyHe0cun2wfT+SVnFVqNYPxoO6J9zcZkVO7+NGig@mail.gmail.com
Attachments:
0001-Align-PGAlignedBlock-to-expected-page-size.patchtext/x-patch; charset=US-ASCII; name=0001-Align-PGAlignedBlock-to-expected-page-size.patchDownload+20-9
0002-XXX-palloc_io_aligned-not-for-review-here.patchtext/x-patch; charset=US-ASCII; name=0002-XXX-palloc_io_aligned-not-for-review-here.patchDownload+141-31
0003-Add-direct-I-O-settings-developer-only.patchtext/x-patch; charset=US-ASCII; name=0003-Add-direct-I-O-settings-developer-only.patchDownload+190-36
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
Hi,
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
AIO patch-set[1]. It adds three new settings, defaulting to off:io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation
You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment
--
Justin
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisationYou added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)
Interesting idea. So "direct_io = data, wal, wal_init", or maybe that
should be spelled io_direct. ("Direct I/O" is a common term of art,
but we also have some more io_XXX GUCs in later patches, so it's hard
to choose...)
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.
Good idea, will do.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment
Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...
Hi,
On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisationYou added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.
In the past more complicated GUCs have not been well received, but it does
seem like a nice way to reduce the amount of redundant stuff.
Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?
(By "better", I mean reducing the number of new GUCs - which is less
important for developer GUCs anyway.)
FWIW, if / once we get to actual AIO, at least some of these would stop being
developer-only GUCs. There's substantial performance benefits in using DIO
with AIO. Buffered IO requires the CPU to copy the data from the userspace
into the kernelspace. But DIO can use DMA for that, freeing the CPU to do more
useful work. Buffered IO tops out much much earlier than AIO + DIO, and
unfortunately tops out at much lower speeds on server CPUs.
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.Good idea, will do.
Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignmentOh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...
It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?
Greetings,
Andres Freund
Hi,
On 2022-11-01 15:54:02 -0700, Andres Freund wrote:
On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?
Although it might actually be worth aligning fully everywhere - there's a
noticable performance difference for buffered read IO.
I benchmarked this on my workstation and laptop.
I mmap'ed a buffer with 2 MiB alignment, MAP_ANONYMOUS | MAP_HUGETLB, and then
measured performance of reading 8192 bytes into the buffer at different
offsets. Each time I copied 16GiB in total. Within a program invocation I
benchmarked each offset 4 times, threw away the worst measurement, and
averaged the rest. Then used the best of three program invocations.
workstation with dual xeon Gold 5215:
turbo on turbo off
offset GiB/s GiB/s
0 18.358 13.528
8 15.361 11.472
9 15.330 11.418
32 17.583 13.097
512 17.707 13.229
513 15.890 11.852
4096 18.176 13.568
8192 18.088 13.566
2Mib 18.658 13.496
laptop with i9-9880H:
turbo on turbo off
offset GiB/s GiB/s
0 33.589 17.160
8 28.045 14.301
9 27.582 14.318
32 31.797 16.711
512 32.215 16.810
513 28.864 14.932
4096 32.503 17.266
8192 32.871 17.277
2Mib 32.657 17.262
Seems pretty clear that using 4096 byte alignment is worth it.
Greetings,
Andres Freund
On 11/1/22 2:36 AM, Thomas Munro wrote:
Hi,
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:
1) work_mem. This is a significant impediment to scaling shared buffers
the way you'd want to.
2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large
systems with 75% of memory going to shared_buffers, that becomes a very
significant problem, especially when the backend running the clock sweep
is doing so in order to perform an operation like a b-tree page split. I
suspect it shouldn't be too hard to deal with this issue by just having
bgwriter or another bgworker proactively ensuring some reasonable number
of buffers with usage_count=0 exist.
One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.
Hi,
On 2022-11-04 14:47:31 -0500, Jim Nasby wrote:
On 11/1/22 2:36 AM, Thomas Munro wrote:
Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the
This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:1) work_mem. This is a significant impediment to scaling shared buffers the
way you'd want to.
I don't really think that's closely enough related to tackle together. Yes,
it'd be easier to set a large s_b if we had better work_mem management, but
it's a completely distinct problem, and in a lot of cases you could use DIO
without tackling the work_mem issue.
2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large systems
with 75% of memory going to shared_buffers, that becomes a very significant
problem, especially when the backend running the clock sweep is doing so in
order to perform an operation like a b-tree page split. I suspect it
shouldn't be too hard to deal with this issue by just having bgwriter or
another bgworker proactively ensuring some reasonable number of buffers with
usage_count=0 exist.
I agree this isn't great, but I don't think the replacement efficiency is that
big a problem. Replacing the wrong buffers is a bigger issue.
I've run tests with s_b=768GB (IIRC) without it showing up as a major
issue. If you have an extreme replacement rate at such a large s_b you have a
lot of other problems.
I don't want to discourage anybody from tackling the clock replacement issues,
the contrary, but AIO+DIO can show significant wins without those
changes. It's already a humongous project...
One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.
Which would be a very bad idea for now.... Thomas does have a patch for moving
them into the main buffer pool.
Greetings,
Andres Freund
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Memory alignment patches:
Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.
One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi,
On 2022-11-10 14:26:20 +0700, John Naylor wrote:
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Memory alignment patches:
Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.
With one exception, there's only a small number of places that allocate pages
dynamically and we only do it for a small number of buffers. So I don't think
we should worry too much about this for now.
The one exception to this: GetLocalBufferStorage(). But it already batches
memory allocations by increasing sizes, so I think we're good as well.
Greetings,
Andres Freund
On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisationYou added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.
Done as io_direct=data,wal,wal_init. Thanks Justin, this is better.
I resisted the urge to invent a meaning for 'on' and 'off', mainly
because it's not clear what values 'on' should enable and it'd be
strange to have off without on, so for now an empty string means off.
I suppose the meaning of this string could evolve over time: the names
of forks, etc.
Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?
Makes sense. The only tricky question was where to store the GUC. I
went for fd.c for now, but it doesn't seem quite right...
DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.Good idea, will do.
Done. The tests take 2-3x as long depending on the OS.
Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.
Done.
Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignmentOh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?
Done. I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock.
You have to use the latter for SMgr, because I added alignment
assertions there. We might as well use it for any other I/O such as
frontend code too for a chance of a small performance boost as you
showed. For now I have not use PGIOAlignedBlock for BufFile, even
though it would be a great candidate for a potential speedup, only
because I am afraid of adding padding to every BufFile in scenarios
where we allocate many (could be avoided, a subject for separate
research).
V2 comprises:
0001 -- David's palloc_aligned() patch
https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on
Attachments:
v2-0001-Add-allocator-support-for-larger-allocation-align.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Add-allocator-support-for-larger-allocation-align.patchDownload+263-14
v2-0002-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload+114-50
v2-0003-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Add-io_direct-setting-developer-only.patchDownload+239-35
v2-0004-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v2-0004-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload+1-2
On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on
David pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.
Attachments:
v3-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload+114-50
v3-0002-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Add-io_direct-setting-developer-only.patchDownload+238-35
v3-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v3-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload+1-2
On Thu, Dec 22, 2022 at 7:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs onDavid pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.
Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:
1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.
I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.
Some comments on the tests added:
1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'),
"read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'),
"read back from local");
+$node->stop('immediate');
2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');
4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Review-comments-io_direct-GUC.txttext/plain; charset=US-ASCII; name=v1-0001-Review-comments-io_direct-GUC.txtDownload+25-38
On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
Thanks for looking at this, and sorry for the late response. Yeah, agreed.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.
Thanks. Your way is nicer. I merged your patch and added you as a co-author.
Some comments on the tests added:
1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
Right, that extension can tell you how many pages are in the kernel
page cache which is quite interesting for this. I also once hacked up
something primitive to see *which* pages are in kernel cache, so I
could join that against pg_buffercache to measure double buffering,
when I was studying the "smile" shape where pgbench TPS goes down and
then back up again as you increase shared_buffers if the working set
is nearly as big as physical memory (code available in a link from
[1]: https://twitter.com/MengTangmu/status/994770040745615361
Yeah, I agree it might be nice for human investigators to put
something like that in contrib/pg_buffercache, but I'm not sure you
could rely on it enough for an automated test, though, 'cause it
probably won't work on some file systems and the tests would probably
fail for random transient reasons (for example: some systems won't
kick pages out of kernel cache if they were already there, just
because we decided to open the file with O_DIRECT). (I got curious
about why mincore() wasn't standardised along with mmap() and all that
jazz; it seems the BSD and later Sun people who invented all those
interfaces didn't think that one was quite good enough[2]http://kos.enix.org/pub/gingell8.pdf, but every
(?) Unixoid OS copied it anyway, with variations... Apparently the
Windows thing is called VirtualQuery()).
2. Can we combine these two append_conf to a single statement? +$node->append_conf('io_direct', 'data,wal,wal_init'); +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O
OK, sure, done. And also oops, that was completely wrong and not
working the way I had it in that version...
3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');
OK.
4. I don't think we need to stop the node before the test ends, no? +$node->stop; + +done_testing();
Sure, but why not?
Otherwise, I rebased, and made a couple more changes:
I found a line of the manual about wal_sync_method that needed to be removed:
- The <literal>open_</literal>* options also use
<literal>O_DIRECT</literal> if available.
In fact that sentence didn't correctly document the behaviour in
released branches (wal_level=minimal is also required for that, so
probably very few people ever used it). I think we should adjust that
misleading sentence in back-branches, separately from this patch set.
I also updated the commit message to highlight the only expected
user-visible change for this, namely the loss of the above incorrectly
documented obscure special case, which is replaced by the less obscure
new setting io_direct=wal, if someone still wants that behaviour.
Also a few minor comment changes.
[1]: https://twitter.com/MengTangmu/status/994770040745615361
[2]: http://kos.enix.org/pub/gingell8.pdf
Attachments:
v4-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Introduce-PG_IO_ALIGN_SIZE-and-align-all-I-O-buff.patchDownload+112-49
v4-0002-Add-io_direct-setting-developer-only.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Add-io_direct-setting-developer-only.patchDownload+233-36
v4-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchtext/x-patch; charset=US-ASCII; name=v4-0003-XXX-turn-on-direct-I-O-by-default-just-for-CI.patchDownload+1-2
I did some testing with non-default block sizes, and found a few minor
things that needed adjustment. The short version is that I blocked
some configurations that won't work or would break an assertion.
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.
The adjustments were:
1. If you set your BLCKSZ or XLOG_BLCKSZ smaller than
PG_IO_ALIGN_SIZE, you shouldn't be allowed to turn on direct I/O for
the relevant operations, because such undersized direct I/Os will fail
on common systems.
FATAL: invalid value for parameter "io_direct": "wal"
DETAIL: io_direct is not supported for WAL because XLOG_BLCKSZ is too small
FATAL: invalid value for parameter "io_direct": "data"
DETAIL: io_direct is not supported for data because BLCKSZ is too small
In fact some systems would be OK with it if the true requirement is
512 not 4096, but (1) tiny blocks are a niche build option that
doesn't even pass regression tests and (2) it's hard and totally
unportable to find out the true requirement at runtime, and (3) the
conservative choice of 4096 has additional benefits by matching memory
pages. So I think a conservative compile-time number is a good
starting position.
2. Previously I had changed the WAL buffer alignment to be the larger
of PG_IO_ALIGN_SIZE and XLOG_BLCKSZ, but in light of the above
thinking, I reverted that part (no point in aligning the address of
the buffer when the size is too small for direct I/O, but now that
combination is blocked off at GUC level so we don't need any change
here).
3. I updated the md.c alignment assertions to allow for tiny blocks.
The point of these assertions is to fail if any new code does I/O from
badly aligned buffers even with io_direct turned off (ie how most
people hack), 'cause that will fail with io_direct turned on. The
change is that I don't make the assertion if you're using BLCKSZ <
PG_IO_ALIGN_SIZE. Such buffers wouldn't work if used for direct I/O
but that's OK, the GUC won't allow it.
4. I made the language to explain where PG_IO_ALIGN_SIZE really comes
from a little vaguer because it's complex.
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.
Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.
Thomas Munro <thomas.munro@gmail.com> writes:
I did some testing with non-default block sizes, and found a few minor
things that needed adjustment.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2004%3A42%3A04
This seems like another thing that should not have been pushed mere
hours before feature freeze.
regards, tom lane
Hi,
On 2023-04-08 16:59:20 +1200, Thomas Munro wrote:
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.
There were some failures in CI (e.g. [1]https://cirrus-ci.com/task/4519721039560704 (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.
But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.
#4 0x00005624dfe90336 in errfinish (filename=0x5624df6867c0 "../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c", lineno=353,
funcname=0x5624df686900 <__func__.6> "StrategyGetBuffer") at ../../../../home/andres/src/postgresql/src/backend/utils/error/elog.c:604
#5 0x00005624dfc71dbe in StrategyGetBuffer (strategy=0x0, buf_state=0x7ffd4182137c, from_ring=0x7ffd4182137b)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c:353
#6 0x00005624dfc6a922 in GetVictimBuffer (strategy=0x0, io_context=IOCONTEXT_NORMAL)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1601
#7 0x00005624dfc6a29f in BufferAlloc (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, strategy=0x0, foundPtr=0x7ffd418214a3,
io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1290
#8 0x00005624dfc69c0c in ReadBuffer_common (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0,
hit=0x7ffd4182156b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1056
#9 0x00005624dfc69335 in ReadBufferExtended (reln=0x5624e1ee09f0, forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:776
#10 0x00005624df8eb78a in log_newpage_range (rel=0x5624e1ee09f0, forknum=MAIN_FORKNUM, startblk=0, endblk=45, page_std=false)
at ../../../../home/andres/src/postgresql/src/backend/access/transam/xloginsert.c:1290
#11 0x00005624df9567e7 in smgrDoPendingSyncs (isCommit=true, isParallelWorker=false)
at ../../../../home/andres/src/postgresql/src/backend/catalog/storage.c:837
#12 0x00005624df8d1dd2 in CommitTransaction () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:2225
#13 0x00005624df8d2da2 in CommitTransactionCommand () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:3060
#14 0x00005624dfcbe0a1 in finish_xact_command () at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:2779
#15 0x00005624dfcbb867 in exec_simple_query (query_string=0x5624e1eacd98 "create table t1 as select 1 as i from generate_series(1, 10000)")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1299
#16 0x00005624dfcc09c4 in PostgresMain (dbname=0x5624e1ee40e8 "postgres", username=0x5624e1e6c5f8 "andres")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4623
#17 0x00005624dfbecc03 in BackendRun (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461
#18 0x00005624dfbec48e in BackendStartup (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189
#19 0x00005624dfbe8541 in ServerLoop () at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779
#20 0x00005624dfbe7e56 in PostmasterMain (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1463
#21 0x00005624dfad538b in main (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/main/main.c:200
If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.
Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.
It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?
Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?
Greetings,
Andres Freund
On Sat, Apr 8, 2023 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.
No particular ideas about what happened there yet. It *looks* like we
wrote out a page, and then read it back in very soon afterwards, all
via the usual locked bufmgr/smgr pathways, and it failed basic page
validation. The reader was a parallel worker, because of the
debug_parallel_mode setting on that box. The page number looks
reasonable (I guess it's reading a page created by the UPDATE full of
new tuples, but I don't know which process wrote it). It's also not
immediately obvious how this could be connected to the 32 pinned
buffer problem (all that would have happened in the CREATE TABLE
process which ended already before the UPDATE and then the SELECT
backends even started).
Andrew, what file system and type of disk is that machine using?
Hi,
On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.[backtrace]
If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?
Here's a quick prototype of this approach. If we expose LimitAdditionalPins(),
we'd probably want to add "Buffer" to the name, and pass it a relation, so
that it can hand off LimitAdditionalLocalPins() when appropriate? The callsite
in question doesn't need it, but ...
Without the limiting of pins the modified 004_io_direct.pl fails 100% of the
time for me.
Presumably the reason it fails occasionally with 256kB of shared buffers
(i.e. NBuffers=32) is that autovacuum or checkpointer briefly pins a single
buffer. As log_newpage_range() thinks it can just pin 32 buffers
unconditionally, it fails in that case.
Greetings,
Andres Freund