Direct I/O

thomas.munro@gmail.com

over 3 years ago

In reply to: Justin Pryzby (#2)

Re: Direct I/O

On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings. (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)

Interesting idea. So "direct_io = data, wal, wal_init", or maybe that
should be spelled io_direct. ("Direct I/O" is a common term of art,
but we also have some more io_XXX GUCs in later patches, so it's hard
to choose...)

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Good idea, will do.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

andres@anarazel.de

over 3 years ago

In reply to: Thomas Munro (#3)

Re: Direct I/O

Hi,

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:

On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.

In the past more complicated GUCs have not been well received, but it does
seem like a nice way to reduce the amount of redundant stuff.

Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?

(By "better", I mean reducing the number of new GUCs - which is less
important for developer GUCs anyway.)

FWIW, if / once we get to actual AIO, at least some of these would stop being
developer-only GUCs. There's substantial performance benefits in using DIO
with AIO. Buffered IO requires the CPU to copy the data from the userspace
into the kernelspace. But DIO can use DMA for that, freeing the CPU to do more
useful work. Buffered IO tops out much much earlier than AIO + DIO, and
unfortunately tops out at much lower speeds on server CPUs.

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Good idea, will do.

Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Greetings,

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#4)

Re: Direct I/O

Hi,

On 2022-11-01 15:54:02 -0700, Andres Freund wrote:

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Although it might actually be worth aligning fully everywhere - there's a
noticable performance difference for buffered read IO.

I benchmarked this on my workstation and laptop.

I mmap'ed a buffer with 2 MiB alignment, MAP_ANONYMOUS | MAP_HUGETLB, and then
measured performance of reading 8192 bytes into the buffer at different
offsets. Each time I copied 16GiB in total. Within a program invocation I
benchmarked each offset 4 times, threw away the worst measurement, and
averaged the rest. Then used the best of three program invocations.

workstation with dual xeon Gold 5215:

turbo on turbo off
offset GiB/s GiB/s
0 18.358 13.528
8 15.361 11.472
9 15.330 11.418
32 17.583 13.097
512 17.707 13.229
513 15.890 11.852
4096 18.176 13.568
8192 18.088 13.566
2Mib 18.658 13.496

laptop with i9-9880H:

turbo on turbo off
offset GiB/s GiB/s
0 33.589 17.160
8 28.045 14.301
9 27.582 14.318
32 31.797 16.711
512 32.215 16.810
513 28.864 14.932
4096 32.503 17.266
8192 32.871 17.277
2Mib 32.657 17.262

Seems pretty clear that using 4096 byte alignment is worth it.

Greetings,

Andres Freund

Jim Nasby

Jim.Nasby@BlueTreble.com

over 3 years ago

In reply to: Thomas Munro (#1)

Re: Direct I/O

On 11/1/22 2:36 AM, Thomas Munro wrote:

Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the

This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:

1) work_mem. This is a significant impediment to scaling shared buffers
the way you'd want to.

2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large
systems with 75% of memory going to shared_buffers, that becomes a very
significant problem, especially when the backend running the clock sweep
is doing so in order to perform an operation like a b-tree page split. I
suspect it shouldn't be too hard to deal with this issue by just having
bgwriter or another bgworker proactively ensuring some reasonable number
of buffers with usage_count=0 exist.

One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.

andres@anarazel.de

over 3 years ago

In reply to: Jim Nasby (#6)

Re: Direct I/O

Hi,

On 2022-11-04 14:47:31 -0500, Jim Nasby wrote:

On 11/1/22 2:36 AM, Thomas Munro wrote:

Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the

This is exciting to see! There's two other items to add to the TODO list
before this would be ready for production:

1) work_mem. This is a significant impediment to scaling shared buffers the
way you'd want to.

I don't really think that's closely enough related to tackle together. Yes,
it'd be easier to set a large s_b if we had better work_mem management, but
it's a completely distinct problem, and in a lot of cases you could use DIO
without tackling the work_mem issue.

2) Clock sweep. Specifically, currently the only thing that drives
usage_count is individual backends running the clock hand. On large systems
with 75% of memory going to shared_buffers, that becomes a very significant
problem, especially when the backend running the clock sweep is doing so in
order to perform an operation like a b-tree page split. I suspect it
shouldn't be too hard to deal with this issue by just having bgwriter or
another bgworker proactively ensuring some reasonable number of buffers with
usage_count=0 exist.

I agree this isn't great, but I don't think the replacement efficiency is that
big a problem. Replacing the wrong buffers is a bigger issue.

I've run tests with s_b=768GB (IIRC) without it showing up as a major
issue. If you have an extreme replacement rate at such a large s_b you have a
lot of other problems.

I don't want to discourage anybody from tackling the clock replacement issues,
the contrary, but AIO+DIO can show significant wins without those
changes. It's already a humongous project...

One other thing to be aware of: overflowing as SLRU becomes a massive
problem if there isn't a filesystem backing the SLRU. Obviously only an
issue if you try and apply DIO to SLRU files.

Which would be a very bad idea for now.... Thomas does have a patch for moving
them into the main buffer pool.

Greetings,

Andres Freund

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Thomas Munro (#1)

Re: Direct I/O

On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.

--
John Naylor
EDB: http://www.enterprisedb.com

andres@anarazel.de

over 3 years ago

In reply to: John Naylor (#8)

Re: Direct I/O

Hi,

On 2022-11-10 14:26:20 +0700, John Naylor wrote:

On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work). We need to deal with buffers on the stack, the heap and
in shmem. For the stack, see patch 0001. For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency. The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.

With one exception, there's only a small number of places that allocate pages
dynamically and we only do it for a small number of buffers. So I don't think
we should worry too much about this for now.

The one exception to this: GetLocalBufferStorage(). But it already batches
memory allocations by increasing sizes, so I think we're good as well.

Greetings,

Andres Freund

#10

thomas.munro@gmail.com

over 3 years ago

In reply to: Andres Freund (#4)

Re: Direct I/O

On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:

On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:

io_data_direct = whether to use O_DIRECT for main data files
io_wal_direct = ... for WAL
io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.

Done as io_direct=data,wal,wal_init. Thanks Justin, this is better.
I resisted the urge to invent a meaning for 'on' and 'off', mainly
because it's not clear what values 'on' should enable and it'd be
strange to have off without on, so for now an empty string means off.
I suppose the meaning of this string could evolve over time: the names
of forks, etc.

Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?

Makes sense. The only tricky question was where to store the GUC. I
went for fd.c for now, but it doesn't seem quite right...

DIO is slower, but not so much that it can't run under CI. I suggest to
add an 099 commit to enable the feature during development.

Good idea, will do.

Done. The tests take 2-3x as long depending on the OS.

Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.

Done.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type 'struct BufFile', which requires 4096 byte alignment

Oh, so BufFile is palloc'd and contains one of these. BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd. I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Done. I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock.
You have to use the latter for SMgr, because I added alignment
assertions there. We might as well use it for any other I/O such as
frontend code too for a chance of a small performance boost as you
showed. For now I have not use PGIOAlignedBlock for BufFile, even
though it would be a great candidate for a potential speedup, only
because I am afraid of adding padding to every BufFile in scenarios
where we allocate many (could be avoided, a subject for separate
research).

V2 comprises:

0001 -- David's palloc_aligned() patch
https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

#11

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#10)

Re: Direct I/O

On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

David pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.

#12

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Thomas Munro (#11)

Re: Direct I/O

On Thu, Dec 22, 2022 at 7:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

David pushed the first as commit 439f6175, so here is a rebase of the
rest. I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O. In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.

Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:

1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.

I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Some comments on the tests added:

1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'),
"read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'),
"read back from local");
+$node->stop('immediate');

2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');

4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

thomas.munro@gmail.com

about 3 years ago

In reply to: Bharath Rupireddy (#12)

Re: Direct I/O

On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:

1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.

Thanks for looking at this, and sorry for the late response. Yeah, agreed.

2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.

I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Thanks. Your way is nicer. I merged your patch and added you as a co-author.

Some comments on the tests added:

1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.

Right, that extension can tell you how many pages are in the kernel
page cache which is quite interesting for this. I also once hacked up
something primitive to see *which* pages are in kernel cache, so I
could join that against pg_buffercache to measure double buffering,
when I was studying the "smile" shape where pgbench TPS goes down and
then back up again as you increase shared_buffers if the working set
is nearly as big as physical memory (code available in a link from
[1]: https://twitter.com/MengTangmu/status/994770040745615361

Yeah, I agree it might be nice for human investigators to put
something like that in contrib/pg_buffercache, but I'm not sure you
could rely on it enough for an automated test, though, 'cause it
probably won't work on some file systems and the tests would probably
fail for random transient reasons (for example: some systems won't
kick pages out of kernel cache if they were already there, just
because we decided to open the file with O_DIRECT). (I got curious
about why mincore() wasn't standardised along with mmap() and all that
jazz; it seems the BSD and later Sun people who invented all those
interfaces didn't think that one was quite good enough[2]http://kos.enix.org/pub/gingell8.pdf, but every
(?) Unixoid OS copied it anyway, with variations... Apparently the
Windows thing is called VirtualQuery()).

2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

OK, sure, done. And also oops, that was completely wrong and not
working the way I had it in that version...

3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');

OK.

4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();

Sure, but why not?

Otherwise, I rebased, and made a couple more changes:

I found a line of the manual about wal_sync_method that needed to be removed:

- The <literal>open_</literal>* options also use
<literal>O_DIRECT</literal> if available.

In fact that sentence didn't correctly document the behaviour in
released branches (wal_level=minimal is also required for that, so
probably very few people ever used it). I think we should adjust that
misleading sentence in back-branches, separately from this patch set.

I also updated the commit message to highlight the only expected
user-visible change for this, namely the loss of the above incorrectly
documented obscure special case, which is replaced by the less obscure
new setting io_direct=wal, if someone still wants that behaviour.

Also a few minor comment changes.

[1]: https://twitter.com/MengTangmu/status/994770040745615361
[2]: http://kos.enix.org/pub/gingell8.pdf

#14

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#13)

Re: Direct I/O

I did some testing with non-default block sizes, and found a few minor
things that needed adjustment. The short version is that I blocked
some configurations that won't work or would break an assertion.
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

The adjustments were:

1. If you set your BLCKSZ or XLOG_BLCKSZ smaller than
PG_IO_ALIGN_SIZE, you shouldn't be allowed to turn on direct I/O for
the relevant operations, because such undersized direct I/Os will fail
on common systems.

FATAL: invalid value for parameter "io_direct": "wal"
DETAIL: io_direct is not supported for WAL because XLOG_BLCKSZ is too small

FATAL: invalid value for parameter "io_direct": "data"
DETAIL: io_direct is not supported for data because BLCKSZ is too small

In fact some systems would be OK with it if the true requirement is
512 not 4096, but (1) tiny blocks are a niche build option that
doesn't even pass regression tests and (2) it's hard and totally
unportable to find out the true requirement at runtime, and (3) the
conservative choice of 4096 has additional benefits by matching memory
pages. So I think a conservative compile-time number is a good
starting position.

2. Previously I had changed the WAL buffer alignment to be the larger
of PG_IO_ALIGN_SIZE and XLOG_BLCKSZ, but in light of the above
thinking, I reverted that part (no point in aligning the address of
the buffer when the size is too small for direct I/O, but now that
combination is blocked off at GUC level so we don't need any change
here).

3. I updated the md.c alignment assertions to allow for tiny blocks.
The point of these assertions is to fail if any new code does I/O from
badly aligned buffers even with io_direct turned off (ie how most
people hack), 'cause that will fail with io_direct turned on. The
change is that I don't make the assertion if you're using BLCKSZ <
PG_IO_ALIGN_SIZE. Such buffers wouldn't work if used for direct I/O
but that's OK, the GUC won't allow it.

4. I made the language to explain where PG_IO_ALIGN_SIZE really comes
from a little vaguer because it's complex.

#15

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#14)

Re: Direct I/O

On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.

#16

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2004%3A42%3A04

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#14)

Re: Direct I/O

Thomas Munro <thomas.munro@gmail.com> writes:

I did some testing with non-default block sizes, and found a few minor
things that needed adjustment.

This seems like another thing that should not have been pushed mere
hours before feature freeze.

regards, tom lane

#17

andres@anarazel.de

about 3 years ago

In reply to: Thomas Munro (#15)

Re: Direct I/O

Hi,

On 2023-04-08 16:59:20 +1200, Thomas Munro wrote:

On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.

There were some failures in CI (e.g. [1]https://cirrus-ci.com/task/4519721039560704 (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

#4 0x00005624dfe90336 in errfinish (filename=0x5624df6867c0 "../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c", lineno=353,
funcname=0x5624df686900 <__func__.6> "StrategyGetBuffer") at ../../../../home/andres/src/postgresql/src/backend/utils/error/elog.c:604
#5 0x00005624dfc71dbe in StrategyGetBuffer (strategy=0x0, buf_state=0x7ffd4182137c, from_ring=0x7ffd4182137b)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c:353
#6 0x00005624dfc6a922 in GetVictimBuffer (strategy=0x0, io_context=IOCONTEXT_NORMAL)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1601
#7 0x00005624dfc6a29f in BufferAlloc (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, strategy=0x0, foundPtr=0x7ffd418214a3,
io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1290
#8 0x00005624dfc69c0c in ReadBuffer_common (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0,
hit=0x7ffd4182156b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1056
#9 0x00005624dfc69335 in ReadBufferExtended (reln=0x5624e1ee09f0, forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0)
at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:776
#10 0x00005624df8eb78a in log_newpage_range (rel=0x5624e1ee09f0, forknum=MAIN_FORKNUM, startblk=0, endblk=45, page_std=false)
at ../../../../home/andres/src/postgresql/src/backend/access/transam/xloginsert.c:1290
#11 0x00005624df9567e7 in smgrDoPendingSyncs (isCommit=true, isParallelWorker=false)
at ../../../../home/andres/src/postgresql/src/backend/catalog/storage.c:837
#12 0x00005624df8d1dd2 in CommitTransaction () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:2225
#13 0x00005624df8d2da2 in CommitTransactionCommand () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:3060
#14 0x00005624dfcbe0a1 in finish_xact_command () at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:2779
#15 0x00005624dfcbb867 in exec_simple_query (query_string=0x5624e1eacd98 "create table t1 as select 1 as i from generate_series(1, 10000)")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1299
#16 0x00005624dfcc09c4 in PostgresMain (dbname=0x5624e1ee40e8 "postgres", username=0x5624e1e6c5f8 "andres")
at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4623
#17 0x00005624dfbecc03 in BackendRun (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461
#18 0x00005624dfbec48e in BackendStartup (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189
#19 0x00005624dfbe8541 in ServerLoop () at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779
#20 0x00005624dfbe7e56 in PostmasterMain (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1463
#21 0x00005624dfad538b in main (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/main/main.c:200

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

Greetings,

Andres Freund

[1]: https://cirrus-ci.com/task/4519721039560704

#18

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#15)

Re: Direct I/O

On Sat, Apr 8, 2023 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this. I will now watch the
build farm. I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

Hmm. I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl. Let's see what else comes in.

No particular ideas about what happened there yet. It *looks* like we
wrote out a page, and then read it back in very soon afterwards, all
via the usual locked bufmgr/smgr pathways, and it failed basic page
validation. The reader was a parallel worker, because of the
debug_parallel_mode setting on that box. The page number looks
reasonable (I guess it's reading a page created by the UPDATE full of
new tuples, but I don't know which process wrote it). It's also not
immediately obvious how this could be connected to the 32 pinned
buffer problem (all that would have happened in the CREATE TABLE
process which ended already before the UPDATE and then the SELECT
backends even started).

Andrew, what file system and type of disk is that machine using?

#19

andres@anarazel.de

about 3 years ago

In reply to: Andres Freund (#17)

Re: Direct I/O

Hi,

On 2023-04-07 23:04:08 -0700, Andres Freund wrote:

There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available". I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

[backtrace]

If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.

It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?

Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?

Here's a quick prototype of this approach. If we expose LimitAdditionalPins(),
we'd probably want to add "Buffer" to the name, and pass it a relation, so
that it can hand off LimitAdditionalLocalPins() when appropriate? The callsite
in question doesn't need it, but ...

Without the limiting of pins the modified 004_io_direct.pl fails 100% of the
time for me.

Presumably the reason it fails occasionally with 256kB of shared buffers
(i.e. NBuffers=32) is that autovacuum or checkpointer briefly pins a single
buffer. As log_newpage_range() thinks it can just pin 32 buffers
unconditionally, it fails in that case.

Greetings,

Andres Freund

#20

andres@anarazel.de

about 3 years ago

In reply to: Andres Freund (#19)

Re: Direct I/O

Hi,

Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().

Greetings,

Andres Freund

#21

thomas.munro@gmail.com

about 3 years ago

In reply to: Andres Freund (#20)

#22

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#21)

#23

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#22)

#24

andres@anarazel.de

about 3 years ago

In reply to: Tom Lane (#22)

#25

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Andres Freund (#24)

#26

andres@anarazel.de

about 3 years ago

In reply to: Tom Lane (#25)

#27

andrew@dunslane.net

about 3 years ago

In reply to: Andres Freund (#26)

#28

thomas.munro@gmail.com

about 3 years ago

In reply to: Andrew Dunstan (#27)

#29

andrew@dunslane.net

about 3 years ago

In reply to: Andres Freund (#24)

#30

thomas.munro@gmail.com

about 3 years ago

In reply to: Andrew Dunstan (#29)

#31

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#28)

#32

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#31)

#33

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#32)

#34

andres@anarazel.de

about 3 years ago

In reply to: Thomas Munro (#32)

#35

thomas.munro@gmail.com

about 3 years ago

In reply to: Andres Freund (#34)

#36

noah@leadboat.com

about 3 years ago

In reply to: Andres Freund (#19)

#37

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#35)

#38

andrew@dunslane.net

about 3 years ago

In reply to: Thomas Munro (#30)

#39

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#37)

#40

thomas.munro@gmail.com

about 3 years ago

In reply to: Andrew Dunstan (#38)

#41

andrew@dunslane.net

about 3 years ago

In reply to: Thomas Munro (#40)

#42

andrew@dunslane.net

about 3 years ago

In reply to: Andrew Dunstan (#41)

#43

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#39)

#44

andres@anarazel.de

about 3 years ago

In reply to: Noah Misch (#36)

#45

noah@leadboat.com

about 3 years ago

In reply to: Andres Freund (#44)

#46

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#43)

#47

andres@anarazel.de

about 3 years ago

In reply to: Thomas Munro (#39)

#48

Andrea Gelmini

andrea.gelmini@gmail.com

about 3 years ago

In reply to: Andres Freund (#47)

#49

thomas.munro@gmail.com

about 3 years ago

In reply to: Andres Freund (#47)

#50

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#49)

#51

andres@anarazel.de

about 3 years ago

In reply to: Thomas Munro (#49)

#52

andres@anarazel.de

about 3 years ago

In reply to: Andres Freund (#51)

#53

thomas.munro@gmail.com

about 3 years ago

In reply to: Andres Freund (#52)

#54

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#53)

#55

andres@anarazel.de

about 3 years ago

In reply to: Noah Misch (#45)

#56

Christoph Berg

myon@debian.org

about 3 years ago

In reply to: Thomas Munro (#1)

#57

thomas.munro@gmail.com

about 3 years ago

In reply to: Christoph Berg (#56)

#58

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#57)

#59

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#58)

#60

andrew@dunslane.net

about 3 years ago

In reply to: Thomas Munro (#58)

#61

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 3 years ago

In reply to: Thomas Munro (#58)

#62

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 3 years ago

In reply to: Andrew Dunstan (#60)

#63

andrew@dunslane.net

about 3 years ago

In reply to: Dagfinn Ilmari Mannsåker (#62)

#64

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 3 years ago

In reply to: Andrew Dunstan (#63)

#65

andrew@dunslane.net

about 3 years ago

In reply to: Dagfinn Ilmari Mannsåker (#64)

#66

thomas.munro@gmail.com

about 3 years ago

In reply to: Andrew Dunstan (#65)

#67

Christoph Berg

myon@debian.org

about 3 years ago

In reply to: Thomas Munro (#66)

#68

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#66)

#69

andres@anarazel.de

about 3 years ago

In reply to: Tom Lane (#68)

#70

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Andres Freund (#69)

#71

andres@anarazel.de

about 3 years ago

In reply to: Tom Lane (#70)

#72

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Andres Freund (#71)

#73

mikael.kjellstrom@gmail.com

about 3 years ago

In reply to: Andres Freund (#71)

#74

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#72)

#75

thomas.munro@gmail.com

about 3 years ago

In reply to: Mikael Kjellström (#73)

#76

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#75)

#77

mikael.kjellstrom@mksoft.nu

about 3 years ago

In reply to: Tom Lane (#76)

#78

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#74)

#79

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#78)

#80

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Thomas Munro (#79)

#81

Justin Pryzby

pryzby@telsasoft.com

about 3 years ago

In reply to: Tom Lane (#78)

#82

mikael.kjellstrom@mksoft.nu

about 3 years ago

In reply to: Tom Lane (#80)

#83

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Mikael Kjellström (#82)

#84

mikael.kjellstrom@mksoft.nu

about 3 years ago

In reply to: Tom Lane (#83)

#85

andrew@dunslane.net

about 3 years ago

In reply to: Tom Lane (#83)

#86

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Andrew Dunstan (#85)

#87

andrew@dunslane.net

about 3 years ago

In reply to: Tom Lane (#86)

#88

mikael.kjellstrom@mksoft.nu

about 3 years ago

In reply to: Andrew Dunstan (#87)

#89

Michael Paquier

michael@paquier.xyz

about 3 years ago

In reply to: Mikael Kjellström (#84)

#90

mikael.kjellstrom@mksoft.nu

about 3 years ago

In reply to: Mikael Kjellström (#88)

#91

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Tom Lane (#78)

#92

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Robert Haas (#91)

#93

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#92)

#94

Bruce Momjian

bruce@momjian.us

about 3 years ago

In reply to: Thomas Munro (#93)

#95

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Bruce Momjian (#94)

#96

Joe Conway

mail@joeconway.com

about 3 years ago

In reply to: Robert Haas (#95)

#97

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#95)

#98

andres@anarazel.de

about 3 years ago

In reply to: Thomas Munro (#93)

#99

andres@anarazel.de

about 3 years ago

In reply to: Bruce Momjian (#94)

#100

Robert Haas

robertmhaas@gmail.com

about 3 years ago

In reply to: Andres Freund (#97)

#101

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#100)

#102

noah@leadboat.com

about 3 years ago

In reply to: Tom Lane (#92)

#103

thomas.munro@gmail.com

about 3 years ago

In reply to: Noah Misch (#102)

#104

thomas.munro@gmail.com

about 3 years ago

In reply to: Thomas Munro (#103)

#105

Justin Pryzby

pryzby@telsasoft.com

about 3 years ago

In reply to: Thomas Munro (#103)

#106

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Justin Pryzby (#105)

#107

thomas.munro@gmail.com

about 3 years ago

In reply to: Tom Lane (#106)

#108

thomas.munro@gmail.com

about 3 years ago

In reply to: Bruce Momjian (#94)

#109

noah@leadboat.com

about 3 years ago

In reply to: Thomas Munro (#107)

#110

thomas.munro@gmail.com

about 3 years ago

In reply to: Noah Misch (#109)

#111

Peter Eisentraut

peter_e@gmx.net

almost 3 years ago

In reply to: Thomas Munro (#107)

#112

Peter Geoghegan

pg@bowt.ie

almost 3 years ago

In reply to: Andres Freund (#101)

#113