Streaming I/O, vectored I/O (WIP)

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#1)

Re: Streaming I/O, vectored I/O (WIP)

On 31/08/2023 07:00, Thomas Munro wrote:

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that. This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, as
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:

13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I
hoped that this would simplify it. I didn't look closely at that patch,
so maybe it's simpler even though it's more code.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.

All this makes sense. Some random comments on the patches:

+	/* Avoid a slightly more expensive kernel call if there is no benefit. */
+	if (iovcnt == 1)
+		returnCode = pg_pread(vfdP->fd,
+							  iov[0].iov_base,
+							  iov[0].iov_len,
+							  offset);
+	else
+		returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

/*
* Prepare to read a block. The buffer is pinned. If this is a 'hit', then
* the returned buffer can be used immediately. Otherwise, a physical read
* should be completed with CompleteReadBuffers(). PrepareReadBuffer()
* followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
* caller has the opportunity to coalesce reads of neighboring blocks into one
* CompleteReadBuffers() call.
*
* *found is set to true for a hit, and false for a miss.
*
* *allocated is set to true for a miss that allocates a buffer for the first
* time. If there are multiple calls to PrepareReadBuffer() for the same
* block before CompleteReadBuffers() or ReadBuffer_common() finishes the
* read, then only the first such call will receive *allocated == true, which
* the caller might use to issue just one prefetch hint.
*/
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool *found,
bool *allocated)

If you decide you don't want to perform the read, after all, is there a
way to abort it without calling CompleteReadBuffers()? Looking at the
later patch that introduces the streaming read API, seems that it
finishes all the reads, so I suppose we don't need an abort function.
Does it all get cleaned up correctly on error?

/*
* Convert an array of buffer address into an array of iovec objects, and
* return the number that were required. 'iov' must have enough space for up
* to PG_IOV_MAX elements.
*/
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)

The comment is a bit inaccurate. There's an assertion that If nblocks
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for
up to PG_IOV_MAX elements, that's only because we also assume that
nblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) to
prevent them from passing nblocks > PG_IOV_MAX.

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the
WAL replay prefetching patch, and only after calling
pg_streaming_read_reset(). Could pg_streaming_read_reset() call
pg_streaming_read_prefetch() directly? Is there any need to "reset" the
stream, without also starting prefetching?

In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

@@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
* If the WAL record contains a block reference with the given ID, *rlocator,
* *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
* returns true.  Otherwise returns false.
+ *
+ * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
+ * of the pin is transferred to the caller.
*/
bool
XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
@@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
if (blknum)
*blknum = bkpb->blkno;
if (prefetch_buffer)
+	{
*prefetch_buffer = bkpb->prefetch_buffer;
+
+		/*
+		 * Clear this flag is so that we can assert that redo records take
+		 * ownership of all buffers pinned by xlogprefetcher.c.
+		 */
+		bkpb->prefetch_buffer = InvalidBuffer;
+	}
return true;
}

Could these changes be committed independently of all the other changes?

--
Heikki Linnakangas
Neon (https://neon.tech)

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:

I'm a bit disappointed and surprised by
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped
that this would simplify it. I didn't look closely at that patch, so maybe
it's simpler even though it's more code.

A good chunk of the changes is pretty boring stuff. A good chunk of the
remainder could be simplified a lot - it's partially there because vacuumlazy
changed a lot over the last couple years and because a bit more refactoring is
needed. I do think it's actually simpler in some ways - besides being more
efficient...

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

I can't really imagine needing it anytime soon - due to the desire to avoid
ENOSPC for pages in the buffer pool the common pattern is to extend relations
with zeroes on disk, then populate those buffers in memory. It's possible that
you could use something like smgrextendv() when operating directly on the smgr
level - but then I suspect you're going to be better off to extend the
relation to the right size in one operation and then just use smgrwritev() to
write out the contents.

/*
* Prepare to read a block. The buffer is pinned. If this is a 'hit', then
* the returned buffer can be used immediately. Otherwise, a physical read
* should be completed with CompleteReadBuffers(). PrepareReadBuffer()
* followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
* caller has the opportunity to coalesce reads of neighboring blocks into one
* CompleteReadBuffers() call.
*
* *found is set to true for a hit, and false for a miss.
*
* *allocated is set to true for a miss that allocates a buffer for the first
* time. If there are multiple calls to PrepareReadBuffer() for the same
* block before CompleteReadBuffers() or ReadBuffer_common() finishes the
* read, then only the first such call will receive *allocated == true, which
* the caller might use to issue just one prefetch hint.
*/
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool *found,
bool *allocated)

If you decide you don't want to perform the read, after all, is there a way
to abort it without calling CompleteReadBuffers()?

When would that be needed?

Looking at the later patch that introduces the streaming read API, seems
that it finishes all the reads, so I suppose we don't need an abort
function. Does it all get cleaned up correctly on error?

I think it should. The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.

extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL
replay prefetching patch, and only after calling pg_streaming_read_reset().
Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly?
Is there any need to "reset" the stream, without also starting prefetching?

Heh, I think this is a discussion Thomas I were having before...

Greetings,

Andres Freund

[1]: /messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com
[2]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com

thomas.munro@gmail.com

over 2 years ago

In reply to: Andres Freund (#3)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Sep 28, 2023 at 9:13 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:

Looking at the later patch that introduces the streaming read API, seems
that it finishes all the reads, so I suppose we don't need an abort
function. Does it all get cleaned up correctly on error?

I think it should. The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.

One thing to highlight is that this patch doesn't create a new state
in that case. In master, we already have the concept of a buffer with
BM_TAG_VALID but not BM_VALID and not BM_IO_IN_PROGRESS, reachable if
there is an I/O error. Eventually another reader will try the I/O
again, or the buffer will fall out of the pool. With this patch it's
the same, it's just a wider window: more kinds of errors might be
thrown in code between Prepare() and Complete() before we even have
BM_IO_IN_PROGRESS. So there is nothing extra to clean up. Right?

Yeah, it would be nice to test buffer pool logic directly. Perhaps
with a C unit test framework[1]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com and pluggable smgr[2]/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com we could mock up
cases like I/O errors...

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.

Another idea I considered was that streams could be associated with a
single relation, but recovery could somehow manage a set of them.
From a certain point of view, that makes sense (we could be redoing
work that was created by multiple concurrent streams at 'do' time, and
with the approach shown here some clustering opportunities available
at do time are lost at redo time), but it's not at all clear that it's
worth the overheads or complexity, and I couldn't immediately figure
out how to do it. But I doubt there would ever be any other users of
a single stream with multiple relations, and I agree that this is
somehow not quite satisfying... Perhaps we should think about that
some more...

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

+     /* Avoid a slightly more expensive kernel call if there is no benefit. */
+     if (iovcnt == 1)
+             returnCode = pg_pread(vfdP->fd,
+                                                       iov[0].iov_base,
+                                                       iov[0].iov_len,
+                                                       offset);
+     else
+             returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.

Done. I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already! I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).

I don't see anything in the callers (mdreadv() and mdwritev()) to
prevent them from passing nblocks > PG_IOV_MAX.

The outer loop in md*v() copes with segment boundaries and also makes
sure lengthof(iov) AKA PG_IOV_MAX isn't exceeded (though that couldn't
happen with the current caller).

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.

I've added some ramp-up logic. The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet. Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072

I guess it could be done in quite a few different ways and I'm open to
better ideas. This way inserts prefetching stalls but ramps up
quickly and is soon out of the way. I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size? Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.

A small detour: While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem. One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls. We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access. But the logic can't see
that another process is making holes in this process's sequence. The
two obvious solutions are (1) pass in a flag at the start saying "I
promise this is sequential even if it doesn't look like it, no hints
please" and (2) invent "shared" (cross-process) streaming reads, and
teach all the parallel seq scan processes to get their buffers from
there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now. So perhaps I should implement (1), which would be another
reason to add a flags argument. It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.

I've included just the pg_prewarm example user for now while we
discuss the basic infrastructure. The rest are rebased and in my
public Github branch streaming-read (repo macdice/postgres) if anyone
is interested (don't mind the red CI failures, they're just saying I
ran out of monthly CI credits on the 29th, so close...)

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
+     /* Avoid a slightly more expensive kernel call if there is no benefit. */
+     if (iovcnt == 1)
+             returnCode = pg_pread(vfdP->fd,
+                                                       iov[0].iov_base,
+                                                       iov[0].iov_len,
+                                                       offset);
+     else
+             returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.
Done. I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already! I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).

LGTM. I think this 0001 patch is ready for commit, independently of the
rest of the patches.

In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:

+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+

These seem out of place, we already have them in common/file_utils.h.
Other than that,
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
good to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.

Ok. Two APIs is a bit redundant, but because most callers would prefer
the simpler API, that's probably a good tradeoff.

I've added some ramp-up logic. The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet. Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072

I guess it could be done in quite a few different ways and I'm open to
better ideas. This way inserts prefetching stalls but ramps up
quickly and is soon out of the way. I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size? Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.

I think a 'flags' argument and a way to opt-out of the slow start would
make sense. pg_prewarm in particular knows that it will read the whole
relation.

A small detour: While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem. One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls. We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access. But the logic can't see
that another process is making holes in this process's sequence.

Hmm, aside from making the sequential pattern invisible to this process,
are we defeating Linux's logic too, just by performing the reads from
multiple processes? The processes might issue the reads to the kernel
out-of-order.

How bad is the slowdown when you issue WILLNEED hints on sequential access?

The two obvious solutions are (1) pass in a flag at the start saying
"I promise this is sequential even if it doesn't look like it, no
hints please" and (2) invent "shared" (cross-process) streaming
reads, and teach all the parallel seq scan processes to get their
buffers from there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now. So perhaps I should implement (1), which would be another
reason to add a flags argument. It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.

(1) seems fine to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

melanieplageman@gmail.com

over 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 01:17:19AM +1300, Thomas Munro wrote:

Thanks for posting a new version. I've included a review of 0004.

I've included just the pg_prewarm example user for now while we
discuss the basic infrastructure. The rest are rebased and in my
public Github branch streaming-read (repo macdice/postgres) if anyone
is interested (don't mind the red CI failures, they're just saying I
ran out of monthly CI credits on the 29th, so close...)

I agree it makes sense to commit the interface with just prewarm as a
user. Then we can start new threads for the various streaming read users
(e.g. vacuum, sequential scan, bitmapheapscan).

From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer().

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504c..8ae3a72053 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;

-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}

-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit,
+							   &allocated);
+
+	/* At this point we do NOT hold any locks. */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
+ * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the

ot -> to

+ * caller has the opportunity to coalesce reads of neighboring blocks into one
+ * CompleteReadBuffers() call.
+ *
+ * *found is set to true for a hit, and false for a miss.
+ *
+ * *allocated is set to true for a miss that allocates a buffer for the first
+ * time.  If there are multiple calls to PrepareReadBuffer() for the same
+ * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
+ * read, then only the first such call will receive *allocated == true, which
+ * the caller might use to issue just one prefetch hint.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *found,
+				  bool *allocated)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;

+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
}
else
{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;

You've lost this test in your new version. You can do the same thing
(avoid counting zeroed buffers as blocks read) by moving this
pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
where you know if you called ZeroBuffer() or CompleteReadBuffers().

}

-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);

-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, found, allocated);
+		if (*found)
+			pgBufferUsage.local_blks_hit++;
+		else
+			pgBufferUsage.local_blks_read++;

See comment above.

+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, found, allocated, io_context);
+		if (*found)
+			pgBufferUsage.shared_blks_hit++;
+		else
+			pgBufferUsage.shared_blks_read++;
+	}
+	if (bmr.rel)
+	{
+		pgstat_count_buffer_read(bmr.rel);

This is double-counting reads. You've left the call in
ReadBufferExtended() as well as adding this here. It should be fine to
remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
the call here in PrepareReadBuffer() wouldn't have an effect on
ReadBuffer_common() callers who don't pass a relation (like recovery).
The other current callers of ReadBuffer_common() (by way of
ExtendBufferedRelTo()) who do pass a relation are visibility map and
freespace map extension, and I don't think we track relation stats for
the VM and FSM.

This does continue the practice of counting zeroed buffers as reads in
table-level stats. But, that is the same as master.

-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{

...

-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/* We found a buffer that we have to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;

-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time();
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);

I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
will be BLCKSZ and multiplying that by the number of reads should
produce the number of bytes read.

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 4efb34b75a..ee9307b612 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -116,7 +116,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, bool *allocPtr)
{
BufferTag	newTag;			/* identity of requested block */
LocalBufferLookupEnt *hresult;
@@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
Assert(BufferTagsEqual(&bufHdr->tag, &newTag));

*foundPtr = PinLocalBuffer(bufHdr, true);
+ *allocPtr = false;
}
else
{
@@ -170,6 +171,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

*foundPtr = false;
+ *allocPtr = true;
}

I would prefer you use consistent naming for
allocPtr/allocatedPtr/allocated. I also think that all the functions
taking it as an output argument should explain what it is
(BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
digging around to figure it out. You have a nice comment about it above
PrepareReadBuffer(). I think you may need to resign yourself to
restating that bit (or some version of it) for all of the functions
taking it as an argument.

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41e26d3e20..e29ca85077 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,8 @@
#ifndef BUFMGR_H
#define BUFMGR_H

+#include "pgstat.h"

I don't know what we are supposed to do, but I would have included this
in bufmgr.c (where I actually needed it) instead of including it here.

+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -47,6 +49,8 @@ typedef enum
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
* in "cleanup" mode */
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */

+	RBM_WILL_ZERO,				/* Don't read from disk, caller will call
+								 * ZeroBuffer() */

It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
isn't used in this commit.

RBM_NORMAL_NO_LOG, /* Don't log page as invalid during WAL
* replay; otherwise same as RBM_NORMAL */
} ReadBufferMode;

- Melanie

[1]: https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#6)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

LGTM. I think this 0001 patch is ready for commit, independently of the
rest of the patches.

Done.

In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:
+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+
These seem out of place, we already have them in common/file_utils.h.

Yeah, they moved from there in f39b2658 and I messed up the rebase. Fixed.

Other than that,
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
good to me.

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point). I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change. The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short. Thoughts?

#10

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#9)

Re: Streaming I/O, vectored I/O (WIP)

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?

If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point). I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change. The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short. Thoughts?

Feels pretty ugly, but I don't see anything outright wrong with that.

--
Heikki Linnakangas
Neon (https://neon.tech)

#11

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#10)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there. If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point). I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change. The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short. Thoughts?

Feels pretty ugly, but I don't see anything outright wrong with that.

Cool. I would consider cleaning up all the callers and get rid of
this ENOSPC stuff in independent work, but I didn't want discussion of
that (eg what external/extension code knows about this API?) to derail
THIS project, hence desire to preserve existing behaviour.

#12

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Thomas Munro (#11)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there. If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Greetings,

Andres Freund

#13

thomas.munro@gmail.com

over 2 years ago

In reply to: Andres Freund (#12)

Re: Streaming I/O, vectored I/O (WIP)

On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Heikki, what do you think about this: we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Note that in file_utils.c we already have pg_pwritev_with_retry(),
which is clearly related to all this: that is a function that
guarantees to either complete the full pwritev() or throw an ERROR,
but leaves it undefined whether any data has been written on ERROR.
It has to add up the size too, and it adjusts the iovec array at the
same time, so it wouldn't use adjust_iovec_for_partial_transfer().
This is essentially the type of interface that I declined to put into
fd.c's FileWrite() and FileRead() because I feel like it doesn't fit
with the existing functions' primary business of adding vfd support to
well known basic I/O functions that return bytes transferred and set
errno. Perhaps someone might later want to introduce File*WithRetry()
wrappers or something if that proves useful? I wouldn't want them for
md.c though because I already know the size.

#14

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#13)

Re: Streaming I/O, vectored I/O (WIP)

On 09/12/2023 02:41, Thomas Munro wrote:

On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Heikki, what do you think about this: we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Ok, works for me.

--
Heikki Linnakangas
Neon (https://neon.tech)

#15

[1]: /messages/by-id/CA+hUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W+fyDSw@mail.gmail.com

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#14)

Re: Streaming I/O, vectored I/O (WIP)

On Sat, Dec 9, 2023 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ok, works for me.

I finished up making a few more improvements:

1. I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

2. FileReadV/FileWriteV patch:

* further simplification of the traditional ENOSPC 'guess'
* unconstify() changed to raw cast (pending [1]/messages/by-id/CA+hUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W+fyDSw@mail.gmail.com)
* fixed the DO_DB()-wrapped debugging code

3. smgrreadv/smgrwritev patch:

* improved ENOSPC handling
* improve description of EOF and ENOSPC handling
* fixed the sizes reported in dtrace static probes
* fixed some words in the docs about that
* changed error messages to refer to "blocks %u..%u"

4. smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.

#16

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#15)

Re: Streaming I/O, vectored I/O (WIP)

On 11/12/2023 11:12, Thomas Munro wrote:

1. I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

In compute_remaining_iovec():

'source' and 'destination' may point to the same array, in which
case it is adjusted in-place; otherwise 'destination' must have enough
space for 'iovcnt' elements.

Is there any use case for not adjusting it in place?
pg_pwritev_with_retry() takes a const iovec array, but maybe just remove
the 'const' and document that it scribbles on it?

I'm planning to commit these fairly soon.

--
Heikki Linnakangas
Neon (https://neon.tech)

#17

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#16)

Re: Streaming I/O, vectored I/O (WIP)

On Mon, Dec 11, 2023 at 10:28 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 11/12/2023 11:12, Thomas Munro wrote:

1. I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

In compute_remaining_iovec():

'source' and 'destination' may point to the same array, in which
case it is adjusted in-place; otherwise 'destination' must have enough
space for 'iovcnt' elements.

Is there any use case for not adjusting it in place?
pg_pwritev_with_retry() takes a const iovec array, but maybe just remove
the 'const' and document that it scribbles on it?

I guess I just wanted to preserve pg_pwritev_with_retry()'s existing
prototype, primarily because it matches standard pwritev()/writev().

#18

Cédric Villemain

cedric.villemain+pgsql@abcsql.com

over 2 years ago

In reply to: Thomas Munro (#15)

Re: Streaming I/O, vectored I/O (WIP)

Le 11/12/2023 à 10:12, Thomas Munro a écrit :

3. smgrreadv/smgrwritev patch:

* improved ENOSPC handling
* improve description of EOF and ENOSPC handling
* fixed the sizes reported in dtrace static probes
* fixed some words in the docs about that
* changed error messages to refer to "blocks %u..%u"

4. smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.

Thanks, very useful additions.
Not sure what you have already done to come next...

I have 2 smalls patches here:
* to use range prefetch in pg_prewarm (smgrprefetch only at the moment,
using smgrreadv to come next).
* to support nblocks=0 in smgrprefetch (posix_fadvise supports a len=0
to apply flag from offset to end of file).

Should I add to commitfest ?
---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

#19

[1]: https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read
[2]: /messages/by-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA@mail.gmail.com

melanieplageman@gmail.com

over 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

I've written a new version of the vacuum streaming read user on top of
the rebased patch set [1]https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read. It differs substantially from Andres' and
includes several refactoring patches that can apply on top of master.
As such, I've proposed those in a separate thread [2]/messages/by-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA@mail.gmail.com. I noticed mac
and windows fail to build on CI for my branch with the streaming read
code. I haven't had a chance to investigate -- but I must have done
something wrong on rebase.

- Melanie

#20

thomas.munro@gmail.com

over 2 years ago

In reply to: Melanie Plageman (#8)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Thanks for posting a new version. I've included a review of 0004.

Thanks! I committed the patches up as far as smgr.c before the
holidays. The next thing to commit would be the bufmgr.c one for
vectored ReadBuffer(), v5-0001. Here's my response to your review of
that, which triggered quite a few changes.

See also new version of the streaming_read.c patch, with change list
at end. (I'll talk about v5-0002, the SMgrRelation lifetime one, over
on the separate thread about that where Heikki posted a better
version.)

ot -> to

Fixed.

- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;

You've lost this test in your new version. You can do the same thing
(avoid counting zeroed buffers as blocks read) by moving this
pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
where you know if you called ZeroBuffer() or CompleteReadBuffers().

Yeah, right.

After thinking about that some more, that's true and that placement
would be good, but only if we look just at this patch, where we are
chopping ReadBuffer() into two parts (PrepareReadBuffer() and
CompleteReadBuffers()), and then putting them back together again.
However, soon we'll want to use the two functions separately, and we
won't call ReadBuffer[_common]().

New idea: PrepareReadBuffer() can continue to be in charge of bumping
{local,shared}_blks_hit, but {local,shared}_blks_read++ can happen in
CompleteReadBuffers(). There is no counter for zeroed buffers, but if
there ever is in the future, it can go into ZeroBuffer().

In this version, I've moved that into CompleteReadBuffers(), along
with a new comment to explain a pre-existing deficiency in the whole
scheme: there is a race where you finish up counting a read but
someone else actually does the read, and also counts it. I'm trying
to preserve the existing bean counting logic to the extent possible
across this refactoring.

+     }
+     else
+     {
+             bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+                                                      strategy, found, allocated, io_context);
+             if (*found)
+                     pgBufferUsage.shared_blks_hit++;
+             else
+                     pgBufferUsage.shared_blks_read++;
+     }
+     if (bmr.rel)
+     {
+             pgstat_count_buffer_read(bmr.rel);
This is double-counting reads. You've left the call in
ReadBufferExtended() as well as adding this here. It should be fine to
remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
the call here in PrepareReadBuffer() wouldn't have an effect on
ReadBuffer_common() callers who don't pass a relation (like recovery).
The other current callers of ReadBuffer_common() (by way of
ExtendBufferedRelTo()) who do pass a relation are visibility map and
freespace map extension, and I don't think we track relation stats for
the VM and FSM.

Oh yeah. Right. Fixed.

This does continue the practice of counting zeroed buffers as reads in
table-level stats. But, that is the same as master.

Right. It is a little strange that pgstast_count_buffer_read()
finishes up in a different location than
pgBufferUsage.{local,shared}_blks_read++, but that's precisely due to
this pre-existing difference in accounting policy. That generally
seems like POLA failure, so I've added a comment to help us remember
about that, for another day.

+             io_start = pgstat_prepare_io_time();
+             smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+             pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);
I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
will be BLCKSZ and multiplying that by the number of reads should
produce the number of bytes read.

OK, thanks, fixed.

BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-                              bool *foundPtr)
+                              bool *foundPtr, bool *allocPtr)
{
BufferTag       newTag;                 /* identity of requested block */
LocalBufferLookupEnt *hresult;
@@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
*foundPtr = PinLocalBuffer(bufHdr, true);
+ *allocPtr = false;
...

I would prefer you use consistent naming for
allocPtr/allocatedPtr/allocated. I also think that all the functions
taking it as an output argument should explain what it is
(BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
digging around to figure it out. You have a nice comment about it above
PrepareReadBuffer(). I think you may need to resign yourself to
restating that bit (or some version of it) for all of the functions
taking it as an argument.

I worked on addressing your complaints, but while doing so I had
second thoughts about even having that argument. The need for it came
up while working on streaming recovery. Without it, whenever there
were repeated references to the same block (as happens very often in
the WAL), it'd issue extra useless POSIX_FADV_WILLNEED syscalls, so I
wanted to find a way to distinguish the *first* miss for a given
block, to be able to issue the advice just once before the actual read
happens.

But now I see that other users of streaming reads probably won't
repeatedly stream the same block, and I am not proposing streaming
recovery for PG 17. Time to simplify. I decided to kick allocatedPtr
out to think about some more, but I really hope we'll be able to start
a real background I/O instead of issuing advice in PG 18 proposals,
and that'll be negotiated via IO_IN_PROGRESS, so then we'd never need
allocatedPtr.

Thus, removed.

#ifndef BUFMGR_H
#define BUFMGR_H

+#include "pgstat.h"

I don't know what we are supposed to do, but I would have included this
in bufmgr.c (where I actually needed it) instead of including it here.

Fixed.

+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -47,6 +49,8 @@ typedef enum
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
* in "cleanup" mode */
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */
+     RBM_WILL_ZERO,                          /* Don't read from disk, caller will call
+                                                              * ZeroBuffer() */
It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
isn't used in this commit.

Yeah. Removed.

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

I now also have a much simplified version of the streaming read patch.
The short version is that it has all advanced features removed, so
that now it *only* does the clustering required to build up large
CompleteReadBuffers() calls. That's short and sweet, and enough for
pg_prewarm to demonstrate 128KB reads, a good first step.

Then WILLNEED advice, and then ramp-up are added as separate patches,
for easier review. I've got some more patches in development that
would re-add "extended" multi-relation mode with wider callback that
can also stream zeroed buffers, as required so far only by recovery --
but I can propose that later.

Other improvements:

* Melanie didn't like the overloaded term "cluster" (off-list
feedback). Now I use "block range" to describe a range of contiguous
blocks (blocknum, nblocks).
* KK didn't like "prefetch" being used with various meanings (off-list
feedback). Better words found.
* pgsr_private might as well be void *; uintptr_t is theoretically
more general but doesn't seem to buy much for realistic use.
* per_io_data might as well be called per_buffer_data (it's not per
"I/O" and that isn't the level of abstraction for this API anyway
which is about blocks and buffers).
* I reordered some function arguments that jumped out after the above changes.
* Once I started down the path of using a flags field to control
various policy stuff as discussed up-thread, it started to seem
clearer that callers probably shouldn't directly control I/O depth,
which was all a bit ad-hoc and unfinished before. I think we'd likely
want that to be centralised. New idea: it should be enough to be able
to specify policies as required now and in future with flags.
Thoughts?

I've also included 4 of the WIP patches from earlier (based on an
obsolete version of the vacuum thing, sorry I know you have a much
better one now), which can be mostly ignored. Here I just wanted to
share working code to drive vectored reads with different access
patterns, and also show the API interactions and how they might each
set flag bits. For example, parallel seq scan uses
PGSR_FLAG_SEQUENTIAL to insist that its scans are sequential despite
appearances, pg_prewarm uses PGSR_FLAG_FULL to declare that it'll read
the whole relation so there is no point in ramping up, and the
vacuum-related stuff uses PGSR_FLAG_MAINTENANCE to select tuning based
on maintenance_io_concurrency instead of effective_io_concurrency.

#21

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#20)

#22

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#21)

#23

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Thomas Munro (#22)

#24

thomas.munro@gmail.com

over 2 years ago

In reply to: Heikki Linnakangas (#23)

#25

byavuz81@gmail.com

over 2 years ago

In reply to: Thomas Munro (#20)

#26

thomas.munro@gmail.com

over 2 years ago

In reply to: Nazir Bilal Yavuz (#25)

#27

thomas.munro@gmail.com

over 2 years ago

In reply to: Thomas Munro (#24)

#28

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Thomas Munro (#27)

#29

thomas.munro@gmail.com

about 2 years ago

In reply to: Robert Haas (#28)

#30

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Thomas Munro (#29)

#31

thomas.munro@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#30)

#32

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#31)

#33

Dilip Kumar

dilipbalaut@gmail.com

about 2 years ago

In reply to: Thomas Munro (#31)

#34

thomas.munro@gmail.com

about 2 years ago

In reply to: Dilip Kumar (#33)

#35

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#34)

#36

byavuz81@gmail.com

about 2 years ago

In reply to: Thomas Munro (#35)

#37

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#34)

#38

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#37)

#39

melanieplageman@gmail.com

about 2 years ago

In reply to: Thomas Munro (#38)

#40

thomas.munro@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#39)

#41

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#38)

#42

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#41)

#43

melanieplageman@gmail.com

about 2 years ago

In reply to: Thomas Munro (#42)

#44

thomas.munro@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#43)

#45

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#44)

#46

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#38)

#47

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#46)

#48

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#45)

#49

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#48)

#50

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#49)

#51

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#50)

#52

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#51)

#53

heikki.linnakangas@enterprisedb.com

about 2 years ago

In reply to: Thomas Munro (#52)

#54

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#53)

#55

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#54)

#56

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#55)

#57

melanieplageman@gmail.com

about 2 years ago

In reply to: Thomas Munro (#56)

#58

byavuz81@gmail.com

about 2 years ago

In reply to: Thomas Munro (#55)

#59

melanieplageman@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#58)

#60

byavuz81@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#58)

#61

byavuz81@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#60)

#62

David Rowley

dgrowleyml@gmail.com

about 2 years ago

In reply to: Nazir Bilal Yavuz (#61)

#63

David Rowley

dgrowleyml@gmail.com

about 2 years ago

In reply to: David Rowley (#62)

#64

thomas.munro@gmail.com

about 2 years ago

In reply to: David Rowley (#63)

#65

thomas.munro@gmail.com

about 2 years ago

In reply to: Thomas Munro (#5)

#66