Vectored I/O in bulk_write.c

Started by Thomas Munroabout 2 years ago16 messageshackers

thomas.munro@gmail.com

about 2 years ago

Hi,

I was trying to learn enough about the new bulk_write.c to figure out
what might be going wrong over at [1]/messages/by-id/CA+hUKGK+5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D+Zg@mail.gmail.com, and finished up doing this
exercise, which is experiment quality but passes basic tests. It's a
bit like v1-0013 and v1-0014's experimental vectored checkpointing
from [2]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com (which themselves are not currently proposed, that too was in
the experiment category), but this usage is a lot simpler and might be
worth considering. Presumably both things would eventually finish up
being done by (not yet proposed) streaming write, but could also be
done directly in this simple case.

This way, CREATE INDEX generates 128kB pwritev() calls instead of 8kB
pwrite() calls. (There's a magic 16 in there, we'd probably need to
think harder about that.) It'd be better if bulk_write.c's memory
management were improved: if buffers were mostly contiguous neighbours
instead of being separately palloc'd objects, you'd probably often get
128kB pwrite() instead of pwritev(), which might be marginally more
efficient.

This made me wonder why smgrwritev() and smgrextendv() shouldn't be
backed by the same implementation, since they are essentially the same
operation. The differences are some assertions which might as well be
moved up to the smgr.c level as they must surely apply to any future
smgr implementation too, right?, and the segment file creation policy
which can be controlled with a new argument. I tried that here. An
alternative would be for md.c to have a workhorse function that both
mdextendv() and mdwritev() call, but I'm not sure if there's much
point in that.

While thinking about that I realised that an existing write-or-extend
assertion in master is wrong because it doesn't add nblocks.

Hmm, it's a bit weird that we have nblocks as int or BlockNumber in
various places, which I think should probably be fixed.

[1]: /messages/by-id/CA+hUKGK+5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D+Zg@mail.gmail.com
[2]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#1)

Re: Vectored I/O in bulk_write.c

Slightly better version, adjusting a few obsoleted comments, adjusting
error message to distinguish write/extend, fixing a thinko in
smgr_cached_nblocks maintenance.

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#2)

Re: Vectored I/O in bulk_write.c

Here also is a first attempt at improving the memory allocation and
memory layout.

I wonder if bulk logging should trigger larger WAL writes in the "Have
to write it ourselves" case in AdvanceXLInsertBuffer(), since writing
8kB of WAL at a time seems like an unnecessarily degraded level of
performance, especially with wal_sync_method=open_datasync. Of course
the real answer is "make sure wal_buffers is high enough for your
workload" (usually indirectly by automatically scaling from
shared_buffers), but this problem jumps out when tracing bulk_writes.c
with default settings. We write out the index 128kB at a time, but
the WAL 8kB at a time.

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#3)

Re: Vectored I/O in bulk_write.c

One more observation while I'm thinking about bulk_write.c... hmm, it
writes the data out and asks the checkpointer to fsync it, but doesn't
call smgrwriteback(). I assume that means that on Linux the physical
writeback sometimes won't happen until the checkpointer eventually
calls fsync() sequentially, one segment file at a time. I see that
it's difficult to decide how to do that though; unlike checkpoints,
which have rate control/spreading, bulk writes could more easily flood
the I/O subsystem in a burst. I expect most non-Linux systems'
write-behind heuristics to fire up for bulk sequential writes, but on
Linux where most users live, there is no write-behind heuristic AFAIK
(on the most common file systems anyway), so you have to crank that
handle if you want it to wake up and smell the coffee before it hits
internal limits, but then you have to decide how fast to crank it.

This problem will come into closer focus when we start talking about
streaming writes. For the current non-streaming bulk_write.c coding,
I don't have any particular idea of what to do about that, so I'm just
noting the observation here.

Sorry for the sudden wall of text/monologue; this is all a sort of
reaction to bulk_write.c that I should perhaps have sent to the
bulk_write.c thread, triggered by a couple of debugging sessions.

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#4)

Re: Vectored I/O in bulk_write.c

(Replying to all your messages in this thread together)

This made me wonder why smgrwritev() and smgrextendv() shouldn't be
backed by the same implementation, since they are essentially the same
operation.

+1 to the general idea of merging the write and extend functions.

The differences are some assertions which might as well be
moved up to the smgr.c level as they must surely apply to any future
smgr implementation too, right?, and the segment file creation policy
which can be controlled with a new argument. I tried that here.

Currently, smgrwrite/smgrextend just calls through to mdwrite/mdextend.
I'd like to keep it that way. Otherwise we need to guess what a
hypothetical smgr implementation might look like.

For example this assertion:

/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
#endif

I think that should continue to be md.c's internal affair. For example,
imagine that you had a syscall like write() but which always writes to
the end of the file and also returns the offset that the data was
written to. You would want to assert the returned offset instead of the
above.

So -1 on moving up the assertions to smgr.c.

Let's bite the bullet and merge the smgrwrite and smgrextend functions
at the smgr level too. I propose the following signature:

#define SWF_SKIP_FSYNC 0x01
#define SWF_EXTEND 0x02
#define SWF_ZERO 0x04

void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffer, BlockNumber nblocks,
int flags);

This would replace smgwrite, smgrextend, and smgrzeroextend. The
mdwritev() function would have the same signature. A single 'flags' arg
looks better in the callers than booleans, because you don't need to
remember what 'true' or 'false' means. The callers would look like this:

/* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, false) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, 0);

/* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, SWF_SKIP_FSYNC);

/* like old smgrextend(reln, MAIN_FORKNUM, 123, buf, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1,
SWF_EXTEND | SWF_SKIP_FSYNC);

/* like old smgrzeroextend(reln, MAIN_FORKNUM, 123, 10, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, NULL, 10,
SWF_EXTEND | SWF_ZERO | SWF_SKIP_FSYNC);

While thinking about that I realised that an existing write-or-extend
assertion in master is wrong because it doesn't add nblocks.

Hmm, it's a bit weird that we have nblocks as int or BlockNumber in
various places, which I think should probably be fixed.

Here also is a first attempt at improving the memory allocation and
memory layout.
...
+typedef union BufferSlot
+{
+	PGIOAlignedBlock buffer;
+	dlist_node	freelist_node;
+}			BufferSlot;
+

If you allocated the buffers in one large contiguous chunk, you could
often do one large write() instead of a gathered writev() of multiple
blocks. That should be even better, although I don't know much of a
difference it makes. The above layout wastes a fair amount memory too,
because 'buffer' is I/O aligned.

I wonder if bulk logging should trigger larger WAL writes in the "Have
to write it ourselves" case in AdvanceXLInsertBuffer(), since writing
8kB of WAL at a time seems like an unnecessarily degraded level of
performance, especially with wal_sync_method=open_datasync. Of course
the real answer is "make sure wal_buffers is high enough for your
workload" (usually indirectly by automatically scaling from
shared_buffers), but this problem jumps out when tracing bulk_writes.c
with default settings. We write out the index 128kB at a time, but
the WAL 8kB at a time.

Makes sense.

On 12/03/2024 23:38, Thomas Munro wrote:

One more observation while I'm thinking about bulk_write.c... hmm, it
writes the data out and asks the checkpointer to fsync it, but doesn't
call smgrwriteback(). I assume that means that on Linux the physical
writeback sometimes won't happen until the checkpointer eventually
calls fsync() sequentially, one segment file at a time. I see that
it's difficult to decide how to do that though; unlike checkpoints,
which have rate control/spreading, bulk writes could more easily flood
the I/O subsystem in a burst. I expect most non-Linux systems'
write-behind heuristics to fire up for bulk sequential writes, but on
Linux where most users live, there is no write-behind heuristic AFAIK
(on the most common file systems anyway), so you have to crank that
handle if you want it to wake up and smell the coffee before it hits
internal limits, but then you have to decide how fast to crank it.

This problem will come into closer focus when we start talking about
streaming writes. For the current non-streaming bulk_write.c coding,
I don't have any particular idea of what to do about that, so I'm just
noting the observation here.

It would be straightforward to call smgrwriteback() from sgmr_bulk_flush
every 1 GB of written data for example. It would be nice to do it in the
background though, and not stall the bulk writing for it. With the AIO
patches, I presume we could easily start an asynchronous writeback and
not wait for it to finish.

Sorry for the sudden wall of text/monologue; this is all a sort of
reaction to bulk_write.c that I should perhaps have sent to the
bulk_write.c thread, triggered by a couple of debugging sessions.

Thanks for looking! This all makes sense. The idea of introducing the
bulk write interface was that now we have a natural place to put all
these heuristics and optimizations in. That seems to be a success,
AFAICS none of the things discussed here will change the bulk_write API,
just the implementation.

--
Heikki Linnakangas
Neon (https://neon.tech)

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#5)

Re: Vectored I/O in bulk_write.c

On Wed, Mar 13, 2024 at 9:57 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Let's bite the bullet and merge the smgrwrite and smgrextend functions
at the smgr level too. I propose the following signature:

#define SWF_SKIP_FSYNC 0x01
#define SWF_EXTEND 0x02
#define SWF_ZERO 0x04

void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffer, BlockNumber nblocks,
int flags);

This would replace smgwrite, smgrextend, and smgrzeroextend. The

That sounds pretty good to me.

Here also is a first attempt at improving the memory allocation and
memory layout.
...
+typedef union BufferSlot
+{
+     PGIOAlignedBlock buffer;
+     dlist_node      freelist_node;
+}                    BufferSlot;
+
If you allocated the buffers in one large contiguous chunk, you could
often do one large write() instead of a gathered writev() of multiple
blocks. That should be even better, although I don't know much of a
difference it makes. The above layout wastes a fair amount memory too,
because 'buffer' is I/O aligned.

The patch I posted has an array of buffers with the properties you
describe, so you get a pwrite() (no 'v') sometimes, and a pwritev()
with a small iovcnt when it wraps around:

pwrite(...) = 131072 (0x20000)
pwritev(...,3,...) = 131072 (0x20000)
pwrite(...) = 131072 (0x20000)
pwritev(...,3,...) = 131072 (0x20000)
pwrite(...) = 131072 (0x20000)

Hmm, I expected pwrite() alternating with pwritev(iovcnt=2), the
latter for when it wraps around the buffer array, so I'm not sure why it's
3. I guess the btree code isn't writing them strictly monotonically or
something...

I don't believe it wastes any memory on padding (except a few bytes
wasted by palloc_aligned() before BulkWriteState):

(gdb) p &bulkstate->buffer_slots[0]
$4 = (BufferSlot *) 0x15c731cb4000
(gdb) p &bulkstate->buffer_slots[1]
$5 = (BufferSlot *) 0x15c731cb6000
(gdb) p sizeof(bulkstate->buffer_slots[0])
$6 = 8192

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#6)

Re: Vectored I/O in bulk_write.c

On 13/03/2024 12:18, Thomas Munro wrote:

On Wed, Mar 13, 2024 at 9:57 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Here also is a first attempt at improving the memory allocation and
memory layout.
...
+typedef union BufferSlot
+{
+     PGIOAlignedBlock buffer;
+     dlist_node      freelist_node;
+}                    BufferSlot;
+
If you allocated the buffers in one large contiguous chunk, you could
often do one large write() instead of a gathered writev() of multiple
blocks. That should be even better, although I don't know much of a
difference it makes. The above layout wastes a fair amount memory too,
because 'buffer' is I/O aligned.
The patch I posted has an array of buffers with the properties you
describe, so you get a pwrite() (no 'v') sometimes, and a pwritev()
with a small iovcnt when it wraps around:

Oh I missed that it was "union BufferSlot". I thought it was a struct.
Never mind then.

--
Heikki Linnakangas
Neon (https://neon.tech)

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#7)

Re: Vectored I/O in bulk_write.c

Alright, here is a first attempt at merging all three interfaces as
you suggested. I like it! I especially like the way it removes lots
of duplication.

I don't understand your argument about the location of the
write-vs-extent assertions. It seems to me that these are assertions
about what the *public* smgrnblocks() function returns. In other
words, we assert that the caller is aware of the current relation size
(and has some kind of interlocking scheme for that to be possible),
according to the smgr implementation's public interface. That's not
an assertion about internal details of the smgr implementation, it's
part of the "contract" for the API.

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#8)

Re: Vectored I/O in bulk_write.c

On 13/03/2024 23:12, Thomas Munro wrote:

Alright, here is a first attempt at merging all three interfaces as
you suggested. I like it! I especially like the way it removes lots
of duplication.

I don't understand your argument about the location of the
write-vs-extent assertions. It seems to me that these are assertions
about what the *public* smgrnblocks() function returns. In other
words, we assert that the caller is aware of the current relation size
(and has some kind of interlocking scheme for that to be possible),
according to the smgr implementation's public interface. That's not
an assertion about internal details of the smgr implementation, it's
part of the "contract" for the API.

I tried to say that smgr implementation might have better ways to assert
that than calling smgrnblocks(), so it would be better to leave it to
the implementation. But what bothered me most was that smgrwrite() had a
different signature than mdwrite(). I'm happy with the way you have it
in the v4 patch.

--
Heikki Linnakangas
Neon (https://neon.tech)

#10

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#9)

Re: Vectored I/O in bulk_write.c

On Thu, Mar 14, 2024 at 10:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I tried to say that smgr implementation might have better ways to assert
that than calling smgrnblocks(), so it would be better to leave it to
the implementation. But what bothered me most was that smgrwrite() had a
different signature than mdwrite(). I'm happy with the way you have it
in the v4 patch.

Looking for other things that can be hoisted up to smgr.c level
because they are part of the contract or would surely need to be
duplicated by any implementation: I think the check that you don't
exceed the maximum possible block number could be there too, no?
Here's a version like that.

Does anyone else want to object to doing this for 17? Probably still
needs some cleanup eg comments etc that may be lurking around the
place and another round of testing. As for the overall idea, I'll
leave it a few days and see if others have feedback. My take is that
this is what bulk_write.c was clearly intended to do, it's just that
smgr let it down by not allowing vectored extension yet. It's a
fairly mechanical simplification, generalisation, and net code
reduction to do so by merging, like this.

#11

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#10)

Re: Vectored I/O in bulk_write.c

I canvassed Andres off-list since smgrzeroextend() is his invention,
and he wondered if it was a good idea to blur the distinction between
the different zero-extension strategies like that. Good question. My
take is that it's fine:

mdzeroextend() already uses fallocate() only for nblocks > 8, but
otherwise writes out zeroes, because that was seen to interact better
with file system heuristics on common systems. That preserves current
behaviour for callers of plain-old sgmrextend(), which becomes a
wrapper for smgrwrite(..., 1, _ZERO | _EXTEND). If some hypothetical
future caller wants to be able to call smgrwritev(..., NULL, 9 blocks,
_ZERO | _EXTEND) directly without triggering the fallocate() strategy
for some well researched reason, we could add a new flag to say so, ie
adjust that gating.

In other words, we have already blurred the semantics. To me, it
seems nicer to have a single high level interface for the same logical
operation, and flags to select strategies more explicitly if that is
eventually necessary.

#12

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Thomas Munro (#11)

Re: Vectored I/O in bulk_write.c

Hi,

On 2024-03-16 12:27:15 +1300, Thomas Munro wrote:

I canvassed Andres off-list since smgrzeroextend() is his invention,
and he wondered if it was a good idea to blur the distinction between
the different zero-extension strategies like that. Good question. My
take is that it's fine:

mdzeroextend() already uses fallocate() only for nblocks > 8, but
otherwise writes out zeroes, because that was seen to interact better
with file system heuristics on common systems. That preserves current
behaviour for callers of plain-old sgmrextend(), which becomes a
wrapper for smgrwrite(..., 1, _ZERO | _EXTEND). If some hypothetical
future caller wants to be able to call smgrwritev(..., NULL, 9 blocks,
_ZERO | _EXTEND) directly without triggering the fallocate() strategy
for some well researched reason, we could add a new flag to say so, ie
adjust that gating.

In other words, we have already blurred the semantics. To me, it
seems nicer to have a single high level interface for the same logical
operation, and flags to select strategies more explicitly if that is
eventually necessary.

I don't think zeroextend on the one hand and and on the other hand a normal
write or extend are really the same operation. In the former case the content
is hard-coded in the latter it's caller provided. Sure, we can deal with that
by special-casing NULL content - but at that point, what's the benefit of
combinding the two operations? I'm not dead-set against this, just not really
convinced that it's a good idea to combine the operations.

Greetings,

Andres Freund

#13

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Andres Freund (#12)

Re: Vectored I/O in bulk_write.c

On Sun, Mar 17, 2024 at 8:10 AM Andres Freund <andres@anarazel.de> wrote:

I don't think zeroextend on the one hand and and on the other hand a normal
write or extend are really the same operation. In the former case the content
is hard-coded in the latter it's caller provided. Sure, we can deal with that
by special-casing NULL content - but at that point, what's the benefit of
combinding the two operations? I'm not dead-set against this, just not really
convinced that it's a good idea to combine the operations.

I liked some things about that, but I'm happy to drop that part.
Here's a version that leaves smgrzeroextend() alone, and I also gave
up on moving errors and assertions up into the smgr layer for now to
minimise the change. So to summarise, this gives us smgrwritev(...,
flags), where flags can include _SKIP_FSYNC and _EXTEND. This way we
don't have to invent smgrextendv(). The old single block smgrextend()
still exists as a static inline wrapper.

#14

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#13)

Re: Vectored I/O in bulk_write.c

Then I would make the trivial change to respect the new
io_combine_limit GUC that I'm gearing up to commit in another thread.
As attached.

#15

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#14)

Re: Vectored I/O in bulk_write.c

Here's a rebase. I decided against committing this for v17 in the
end. There's not much wrong with it AFAIK, except perhaps an
unprincipled chopping up of writes with large io_combine_limit due to
simplistic flow control, and I liked the idea of having a decent user
of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit
faster, but... I'd like to try something more ambitious that
streamifies this and also the "main" writeback paths. I shared some
patches for that that are counterparts to this, over at[1]/messages/by-id/CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com.

[1]: /messages/by-id/CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com

#16

Noah Misch

noah@leadboat.com

almost 2 years ago

In reply to: Thomas Munro (#15)

Re: Vectored I/O in bulk_write.c

On Tue, Apr 09, 2024 at 04:51:52PM +1200, Thomas Munro wrote:

Here's a rebase. I decided against committing this for v17 in the
end. There's not much wrong with it AFAIK, except perhaps an
unprincipled chopping up of writes with large io_combine_limit due to
simplistic flow control, and I liked the idea of having a decent user
of smgrwritev() in the tree, and it probably makes CREATE INDEX a bit
faster, but... I'd like to try something more ambitious that
streamifies this and also the "main" writeback paths.

I see this in the CF as Needs Review since 2024-03-10, but this 2024-04-09
message sounds like you were abandoning it. Are you still commissioning a
review of this thread's patches, or is the CF entry obsolete?

Vectored I/O in bulk_write.c

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: