Some read stream improvements
Hi,
Here are some patches that address some of Andres's feedback since the
AIO v2 rebase[1]/messages/by-id/CA+hUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw@mail.gmail.com, anticipate out-of-order streams, and make some other
minor improvements. They are independent of the main AIO patch set
and apply to master, hence separate thread.
0001-Refactor-read_stream.c-s-circular-arithmetic.patch
This just replaced open-coded arithmetic with inline functions. They
will be used a lot more in later work, and provide central places to
put assertions that were not checked as uniformly as I would like.
0002-Allow-more-buffers-for-sequential-read-streams.patch
Currently we're only generating I/O concurrency for random I/O, and I
originally guesstimated that random I/O wouldn't typically be able to
combine more than about 4 blocks on average when capping the buffer
queue. Debatable perhaps, but soon obsolete: for true asynchronous
I/O, we'll need full concurrency * full combine limit, so we'll
potentially want to pin more buffers. This just changes that
hard-coded 4 to io_combine_limit.
0003-Improve-buffer-pool-API-for-per-backend-pin-limits.patch
In preparation for the next patch, provide some new bufmgr APIs.
0004-Respect-pin-limits-accurately-in-read_stream.c.patch
The current coding only computes the remaining "fair share" of the
buffer pool for this backend at stream initialisation. It's hard, but
not impossible, to get one backend to pin more than 1/max_connections
of the buffer pool (the intended cap), when using multiple streams at
the same time in one backend. This patch moves the time of check to
the time of use, so it respects the limit strictly. I avoid adding
any changes to the fast path for all-cached streams, which only pin
one buffer at a time.
This is basically a TOCTOU bug, and could in theory be back-patched,
but I don't personally think it's necessary yet, since it's so hard to
actually break anything with v17's rather limited use of streams. The
only way I can think of right now to see a concrete problem would be
to get many backends all to create many CURSORs that stream cold data,
and FETCH in a particular order only after they've all been
constructed, which is also a recipe for running out of buffers without
using streams, albeit not quite as quickly.
0005-Support-buffer-forwarding-in-read_stream.c.patch
0006-Support-buffer-forwarding-in-StartReadBuffers.patch
Background: StartReadBuffers() reports a short read when it finds a
cached block that divides a potential I/O. For example, if you ask to
read 8 blocks, and the 6th one turns out to be a hit, ie already valid
in the buffer pool, then we only need to read the first 5 from disk.
Perhaps the 7th and 8th blocks will also need I/O, but we don't want
StartReadBuffers() to start *two* I/Os, so the 6th block terminates
our search. The question is what to do with that 6th block.
Currently we'll tell the user that the size of the operation is 6
blocks -- that is, we'll silently include that hit that prevented
further combining, even though we'll only actually do the read for the
first 5 blocks, and if someone else manages to make any buffers valid
before you call WaitReadBuffers(), we'll just silently skip over them
and loop.
That was simple enough, but having hits that are invisibly mixed up
with misses prevents us from implementing fully general reordering of
hits, an important optimisation that reduces I/O stalls for consumers
that can cope with out-of-order data (I'll post a new patch for that
separately).
A simple solution we rejected was to unpin such buffers and then repin
later, but that would waste cycles, allow eviction and potentially
mess up the usage count. The AIO patch set also introduces another
case involving potentially many buffers: it moves the
BM_IO_IN_PROGRESS negotiation into StartReadBuffers(), and reports a
short read then too for middle blocks, but it has already acquired all
the pins.
The solution we agreed on is to introduce a way for StartReadBuffers()
to communicate with future calls, and "forward" pinned buffers between
calls. The function arguments don't change, but its "buffers"
argument becomes an in/out array: one StartReadBuffers() call can
leave extra pinned buffers after the ones that were included in a
short read (*nblocks), and then when you retry (or possibly extend)
the rest of the read, you have to pass them back in. That is easy for
the read stream code, as it can just leave them in its circular queue
for the next call to take as input. It only needs to be aware of them
for pin limit accounting and stream reset (including early shutdown).
One goal was to avoid introducing any new instructions into
ReadBuffer() or the read stream fast path, so StartReadBuffer(), the
single-block specialisation, doesn't support forwarding (it already
can't split reads, they are only one block, but it also doesn't
support receiving forwarded buffers).
I plan to share some new C-level tests and illustrations of the
internal states of read_stream.c separately, as the complexity is
obviously increasing a bit (especially with out-of-order streams,
about which more soon).
[1]: /messages/by-id/CA+hUKGLxH1tsUgzZfng4BU6GqnS6bKF2ThvxH1_w5c7-sLRKQw@mail.gmail.com
Attachments:
v1-0001-Refactor-read_stream.c-s-circular-arithmetic.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Refactor-read_stream.c-s-circular-arithmetic.patchDownload+61-18
v1-0002-Allow-more-buffers-for-sequential-read-streams.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Allow-more-buffers-for-sequential-read-streams.patchDownload+2-3
v1-0003-Improve-buffer-pool-API-for-per-backend-pin-limit.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Improve-buffer-pool-API-for-per-backend-pin-limit.patchDownload+77-19
v1-0004-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Respect-pin-limits-accurately-in-read_stream.c.patchDownload+95-17
v1-0005-Support-buffer-forwarding-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Support-buffer-forwarding-in-read_stream.c.patchDownload+94-15
v1-0006-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-patch; charset=US-ASCII; name=v1-0006-Support-buffer-forwarding-in-StartReadBuffers.patchDownload+91-39
On Mon, 17 Feb 2025 at 09:55, Thomas Munro <thomas.munro@gmail.com> wrote:
Hi,
Here are some patches that address some of Andres's feedback since the
AIO v2 rebase[1], anticipate out-of-order streams, and make some other
minor improvements. They are independent of the main AIO patch set
and apply to master, hence separate thread.
Hi, great!
0001-Refactor-read_stream.c-s-circular-arithmetic.patch
This just replaced open-coded arithmetic with inline functions. They
will be used a lot more in later work, and provide central places to
put assertions that were not checked as uniformly as I would like.
Just out of curiosity, should we `Assert(*index + n <
stream->queue_size);` in `read_stream_index_advance_n`?
--
Best regards,
Kirill Reshke
On Mon, Feb 17, 2025 at 5:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:
The solution we agreed on is to introduce a way for StartReadBuffers()
to communicate with future calls, and "forward" pinned buffers between
calls. The function arguments don't change, but its "buffers"
argument becomes an in/out array: one StartReadBuffers() call can
leave extra pinned buffers after the ones that were included in a
short read (*nblocks), and then when you retry (or possibly extend)
the rest of the read, you have to pass them back in. That is easy for
the read stream code, as it can just leave them in its circular queue
for the next call to take as input. It only needs to be aware of them
for pin limit accounting and stream reset (including early shutdown).
BTW here's a small historical footnote about that: I had another
solution to the same problem in the original stream proposal[1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com, which
used a three-step bufmgr API. You had to call PrepareReadBuffer() for
each block you intended to read, and then StartReadBuffers() for each
cluster of adjacent misses it reported, and finally WaitReadBuffers().
Hits would require only the first step. That was intended to allow
the stream to manage the prepared buffers itself, short reads would
leave some of them for a later call. Based on review feedback, I
simplified to arrive at the two-step API, ie just start and wait, and
PrepareReadBuffer() became the private helper function
PinBufferForBlock(). I had to invent the "hide-the-trailing-hit"
trick to avoid unpin/repin sequence in the two-step API, thinking we
might eventually want to consider the three-step API again later.
This new design achieves the same end result: buffers that can't be
part of an I/O stay pinned and in the stream's queue ready for the
next call, just like "prepared" buffers in the early prototypes, but
it keeps the two-step bufmgr API and calls them "forwarded".
[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
On Mon, Feb 17, 2025 at 6:55 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
Just out of curiosity, should we `Assert(*index + n <
stream->queue_size);` in `read_stream_index_advance_n`?
No: it is allowed to be >= queue_size temporarily, but if so we
subtract queue_size. The result should be equal to (index + n) %
queue_size, assuming small values of n, except we don't want to use %
in hot code. Perhaps we should assert that though!
Hi,
On 2025-02-17 17:55:09 +1300, Thomas Munro wrote:
0004-Respect-pin-limits-accurately-in-read_stream.c.patch
The current coding only computes the remaining "fair share" of the
buffer pool for this backend at stream initialisation. It's hard, but
not impossible, to get one backend to pin more than 1/max_connections
of the buffer pool (the intended cap), when using multiple streams at
the same time in one backend. This patch moves the time of check to
the time of use, so it respects the limit strictly. I avoid adding
any changes to the fast path for all-cached streams, which only pin
one buffer at a time.
I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.
The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):
/messages/by-id/knr4aazlaa4nj3xnpe4tu6plwayovzxhmteatcpry2j6a6kc4v@aonkl53s2ecs
Just linked instead of attached to not trigger cfbot.
Greetings,
Andres Freund
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):
Thanks. Alright, I'm assuming that you don't have any objections to
the way I restyled that API, so I'm going to go ahead and push some of
these shortly, and then follow up with a few newer patches that
simplify and improve the look-ahead and advice control. More very
soon.
On Thu, Feb 27, 2025 at 11:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):Thanks. Alright, I'm assuming that you don't have any objections to
the way I restyled that API, so I'm going to go ahead and push some of
these shortly, and then follow up with a few newer patches that
simplify and improve the look-ahead and advice control. More very
soon.
Ugh, I realised in another round of self-review that that version
could exceed the soft limit by a small amount if the registered
callback pins more buffers underneath it, so not pushed yet. I think
I see how to fix that (namely the alternative design that a comment
already contemplated), more soon...
On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):Thanks. Alright, I'm assuming that you don't have any objections to
the way I restyled that API, so I'm going to go ahead and push some of
these shortly, and then follow up with a few newer patches that
simplify and improve the look-ahead and advice control. More very
soon.
Indeed, no objections, rather the opposite. Thanks!
Greetings,
Andres Freund
On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):
Here is a subset of those patches again:
1. Per-backend buffer limit, take III. Now the check is in
read_stream_start_pending_read() so TOC == TOU.
Annoyingly, test cases like the one below still fail, despite
following the rules. The other streams eat all the buffers and then
one gets an allowance of zero, but uses its right to take one pin
anyway to make progress, and there isn't one. I wonder if we should
use temp_buffers - 100? Then leave the minimum GUC value at 100
still, so you have an easy way to test with 0, 1, ... additional
buffers?
2. It shouldn't give up issuing random advice immediately after a
jump, or it could stall on (say) the second 128kB of a 256kB
sequential chunk (ie the strace you showed on the BHS thread). It
only makes sense to assume kernel readahead takes over once you've
actually *read* sequentially. In practice this makes it a lot more
aggressive about advice (like the BHS code in master): it only gives
up if the whole look-ahead window is sequential.
3. Change the distance algorithm to care only about hits and misses,
not sequential heuristics. It made at least some sense before, but it
doesn't make sense for AIO, and even in synchronous mode it means that
you hit random jumps with insufficient look-ahead, so I don't think we
should keep it.
I also realised that the sequential heuristics are confused by that
hidden trailing block thing, so in contrived pattern testing with
hit-miss-hit-miss... would be considered sequential, and even if you
fix that (the forwarding patches above fix that), an exact
hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
(double, decrement, double, ... might be worth waiting a bit longer
before decrementing, IDK.
I'll rebase the others and post soon.
set io_combine_limit = 32;
set temp_buffers = 100;
create temp table t1 as select generate_series(1, 10000);
create temp table t2 as select generate_series(1, 10000);
create temp table t3 as select generate_series(1, 10000);
create temp table t4 as select generate_series(1, 10000);
create temp table t5 as select generate_series(1, 10000);
do
$$
declare
c1 cursor for select * from t1;
c2 cursor for select * from t2;
c3 cursor for select * from t3;
c4 cursor for select * from t4;
c5 cursor for select * from t5;
x record;
begin
open c1;
open c2;
open c3;
open c4;
open c5;
loop
fetch next from c1 into x;
exit when not found;
fetch next from c2 into x;
exit when not found;
fetch next from c3 into x;
exit when not found;
fetch next from c4 into x;
exit when not found;
fetch next from c5 into x;
exit when not found;
end loop;
end;
$$;
Attachments:
v2-0001-Improve-buffer-manager-API-for-backend-pin-limits.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Improve-buffer-manager-API-for-backend-pin-limits.patchDownload+78-26
v2-0002-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Respect-pin-limits-accurately-in-read_stream.c.patchDownload+94-15
v2-0003-Improve-read-stream-advice-for-larger-random-read.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Improve-read-stream-advice-for-larger-random-read.patchDownload+36-10
v2-0004-Look-ahead-more-when-sequential-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v2-0004-Look-ahead-more-when-sequential-in-read_stream.c.patchDownload+33-60
Hi,
On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
On Wed, Feb 26, 2025 at 10:55 PM Andres Freund <andres@anarazel.de> wrote:
I was working on expanding tests for AIO and as part of that wrote a test for
temp tables -- our coverage is fairly awful, there were many times during AIO
development where I knew I had trivially reachable temp table specific bugs
but all tests passed.The test for that does trigger the problem described above and is fixed by the
patches in this thread (which I included in the other thread):Here is a subset of those patches again:
1. Per-backend buffer limit, take III. Now the check is in
read_stream_start_pending_read() so TOC == TOU.Annoyingly, test cases like the one below still fail, despite
following the rules. The other streams eat all the buffers and then
one gets an allowance of zero, but uses its right to take one pin
anyway to make progress, and there isn't one.
I think that may be ok. If there are no unpinned buffers, it seems to be
expected that starting a new stream will fail. That's not the same as - as we
did previously - failing in a read stream that did start successfully, because
we issue large IOs even though there are only a small number of unpinned
buffers.
I wonder if we should use temp_buffers - 100? Then leave the minimum GUC
value at 100 still, so you have an easy way to test with 0, 1,
... additional buffers?
I think that just makes it harder to test the exhaustion scenario without
really fixing anything?
2. It shouldn't give up issuing random advice immediately after a
jump, or it could stall on (say) the second 128kB of a 256kB
sequential chunk (ie the strace you showed on the BHS thread). It
only makes sense to assume kernel readahead takes over once you've
actually *read* sequentially. In practice this makes it a lot more
aggressive about advice (like the BHS code in master): it only gives
up if the whole look-ahead window is sequential.
3. Change the distance algorithm to care only about hits and misses,
not sequential heuristics. It made at least some sense before, but it
doesn't make sense for AIO, and even in synchronous mode it means that
you hit random jumps with insufficient look-ahead, so I don't think we
should keep it.I also realised that the sequential heuristics are confused by that
hidden trailing block thing, so in contrived pattern testing with
hit-miss-hit-miss... would be considered sequential, and even if you
fix that (the forwarding patches above fix that), an exact
hit-miss-hit-miss pattern also gets stuck between distances 1 and 2
(double, decrement, double, ... might be worth waiting a bit longer
before decrementing, IDK.I'll rebase the others and post soon.
+ +/* see GetAdditionalPinLimit() */ +uint32 +GetAdditionalLocalPinLimit(void) +{ + Assert(NLocalPinnedBuffers <= num_temp_buffers); + return num_temp_buffers - NLocalPinnedBuffers; +}
This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
0. Which makes some sense, pinning an additional buffer will always
fail. Perhaps worth calling out though?
static void read_stream_look_ahead(ReadStream *stream, bool suppress_advice) { while (stream->ios_in_progress < stream->max_ios && - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) + ((stream->pinned_buffers == 0 && stream->distance > 0) || + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))
What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
mean? And when would it be true when the pre-existing condition wouldn't
already be true?
{
BlockNumber blocknum;
int16 buffer_index;
void *per_buffer_data;- if (stream->pending_read_nblocks == io_combine_limit) + /* If have a pending read that can't be extended, start it now. */ + Assert(stream->pinned_buffers + stream->pending_read_nblocks <= + stream->max_pinned_buffers); + if (stream->pending_read_nblocks == io_combine_limit || + (stream->pinned_buffers == 0 && + stream->pending_read_nblocks == stream->max_pinned_buffers)) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) /* We have to start the pending read before we can build another. */ while (stream->pending_read_nblocks > 0) { - read_stream_start_pending_read(stream, suppress_advice); - suppress_advice = false; - if (stream->ios_in_progress == stream->max_ios) + if (!read_stream_start_pending_read(stream, suppress_advice) || + stream->ios_in_progress == stream->max_ios) { - /* And we've hit the limit. Rewind, and stop here. */ + /* And we've hit a buffer or I/O limit. Rewind and wait. */ read_stream_unget_block(stream, blocknum); return; } + + suppress_advice = false; }
If read_stream_start_pending_read() returns false because we hit the pin
limit, does it really help to call read_stream_unget_block()? IIUC that'll
defer one block for later - but what about the other buffers in a multi-block
read?
@@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
else
Assert(stream->next_buffer_index == stream->oldest_buffer_index);- /* - * If advice hasn't been suppressed, this system supports it, and this - * isn't a strictly sequential pattern, then we'll issue advice. - */ - if (!suppress_advice && - stream->advice_enabled && - stream->pending_read_blocknum != stream->seq_blocknum) + /* Do we need to issue read-ahead advice? */ + flags = 0; + if (stream->advice_enabled) + { flags = READ_BUFFERS_ISSUE_ADVICE; - else - flags = 0; + + if (stream->pending_read_blocknum == stream->seq_blocknum) + { + /* + * Suppress advice if our WaitReadBuffers() calls have caught up + * with the first advice we issued for this sequential run. + */ + if (stream->seq_start == InvalidBlockNumber) + suppress_advice = true; + } + else + { + /* Random jump, so start a new sequential run. */ + stream->seq_start = stream->pending_read_blocknum; + } + + if (suppress_advice) + flags = 0; + }
Seems a bit confusing to first set
flags = READ_BUFFERS_ISSUE_ADVICE
to then later unset it again. Maybe just set it in if (!suppress_advice)?
* Skip the initial ramp-up phase if the caller says we're going to be @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) distance = stream->distance * 2; distance = Min(distance, stream->max_pinned_buffers); stream->distance = distance; + + /* + * If we've caught up with the first advice issued for the current + * sequential run, cancel further advice until the next random + * jump. The kernel should be able to see the pattern now that + * we're issuing sequential preadv() calls. + */ + if (stream->ios[io_index].op.blocknum == stream->seq_start) + stream->seq_start = InvalidBlockNumber;
So stream->seq_start doesn't really denote the start of sequentialness, it
denotes up to where the caller needs to process before we disable sequential
access. Maybe add a comment to it and rename it to something like
->seq_until_processed?
Other than this the approach seems to make sense!
Greetings,
Andres Freund
On Wed, Mar 12, 2025 at 8:29 AM Andres Freund <andres@anarazel.de> wrote:
On 2025-03-12 07:35:46 +1300, Thomas Munro wrote:
On Thu, Feb 27, 2025 at 11:20 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-02-27 11:19:55 +1300, Thomas Munro wrote:
I wonder if we should use temp_buffers - 100? Then leave the minimum GUC
value at 100 still, so you have an easy way to test with 0, 1,
... additional buffers?I think that just makes it harder to test the exhaustion scenario without
really fixing anything?
Not sure I agree yet but I'll come back to this in a bit (I think
there might be something worth thinking about some more here but it's
not in the way of committing these patches).
+/* see GetAdditionalPinLimit() */ +uint32 +GetAdditionalLocalPinLimit(void) +{ + Assert(NLocalPinnedBuffers <= num_temp_buffers); + return num_temp_buffers - NLocalPinnedBuffers; +}This doesn't behave quite the way GetAdditionalPinLimit() does - it can return
0. Which makes some sense, pinning an additional buffer will always
fail. Perhaps worth calling out though?
No, GetAdditionalPinLimit() works that way too. It's only
LimitAdditional[Local]PinLimit() that has the special "you can always
have one" logic that I needed to escape from. But yes I should
highlight that in a comment: done above GetAdditionalPinLimit().
GetAdditionalLocalPinLimit() just tells you to see the shared version.
static void read_stream_look_ahead(ReadStream *stream, bool suppress_advice) { while (stream->ios_in_progress < stream->max_ios && - stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) + ((stream->pinned_buffers == 0 && stream->distance > 0) || + stream->pinned_buffers + stream->pending_read_nblocks < stream->distance))What does the new "stream->pinned_buffers == 0 && stream->distance > 0" really
mean? And when would it be true when the pre-existing condition wouldn't
already be true?
Well the reason is basically that the distance can get chomped lower
than pending_read_nblocks if you recently hit the pin limit, and we
have to be able to start your I/O anyway (at least one block of it) to
make progress. But I realised that was a stupid way to handle that,
and actually I had screwed up further down, and the right way is just:
@@ -382,15 +435,25 @@ read_stream_look_ahead(ReadStream *stream, bool
suppress_advice)
* io_combine_limit size once more buffers have been consumed. However,
* if we've already reached io_combine_limit, or we've reached the
* distance limit and there isn't anything pinned yet, or the
callback has
- * signaled end-of-stream, we start the read immediately.
+ * signaled end-of-stream, we start the read immediately. Note that the
+ * pending read could even exceed the distance goal, if the latter was
+ * reduced on buffer limit exhaustion.
*/
if (stream->pending_read_nblocks > 0 &&
(stream->pending_read_nblocks == stream->io_combine_limit ||
- (stream->pending_read_nblocks == stream->distance &&
+ (stream->pending_read_nblocks >= stream->distance &&
stream->pinned_buffers == 0) ||
stream->distance == 0) &&
stream->ios_in_progress < stream->max_ios)
read_stream_start_pending_read(stream, suppress_advice);
{
BlockNumber blocknum;
int16 buffer_index;
void *per_buffer_data;- if (stream->pending_read_nblocks == io_combine_limit) + /* If have a pending read that can't be extended, start it now. */ + Assert(stream->pinned_buffers + stream->pending_read_nblocks <= + stream->max_pinned_buffers); + if (stream->pending_read_nblocks == io_combine_limit || + (stream->pinned_buffers == 0 && + stream->pending_read_nblocks == stream->max_pinned_buffers)) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; @@ -360,14 +409,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) /* We have to start the pending read before we can build another. */ while (stream->pending_read_nblocks > 0) { - read_stream_start_pending_read(stream, suppress_advice); - suppress_advice = false; - if (stream->ios_in_progress == stream->max_ios) + if (!read_stream_start_pending_read(stream, suppress_advice) || + stream->ios_in_progress == stream->max_ios) { - /* And we've hit the limit. Rewind, and stop here. */ + /* And we've hit a buffer or I/O limit. Rewind and wait. */ read_stream_unget_block(stream, blocknum); return; } + + suppress_advice = false; }If read_stream_start_pending_read() returns false because we hit the pin
limit, does it really help to call read_stream_unget_block()? IIUC that'll
defer one block for later - but what about the other buffers in a multi-block
read?
They are already represented by (pending_read_blocknum,
pending_read_nblocks). We were unable to append this block number to
the pending read because it's already full-sized or the newly acquired
block number isn't consecutive, but we were also unable to start the
pending read to get it out of the way. That was a pre-existing edge
case that you could already hit if it turned out to be a short read:
ie the remaining part of the pending read is still in your way, and
now you've reached stream->max_ios so you can't start it. So we had
to put it aside for later.
This change piggy-backs on that approach for buffer starvation:
read_stream_start_buffers() can now decline to start even a partial
read. In fact it usually declines, unless it is forced to accept a
short read because stream->pinned_buffers == 0 (ie, we have to do
whatever we can to make progress).
It's OK that pending_read_nblocks exceeds what we can start right now,
we still remember it, and we can always start it one block at a time
if it comes to it. Make sense?
@@ -260,16 +261,30 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
else
Assert(stream->next_buffer_index == stream->oldest_buffer_index);- /* - * If advice hasn't been suppressed, this system supports it, and this - * isn't a strictly sequential pattern, then we'll issue advice. - */ - if (!suppress_advice && - stream->advice_enabled && - stream->pending_read_blocknum != stream->seq_blocknum) + /* Do we need to issue read-ahead advice? */ + flags = 0; + if (stream->advice_enabled) + { flags = READ_BUFFERS_ISSUE_ADVICE; - else - flags = 0; + + if (stream->pending_read_blocknum == stream->seq_blocknum) + { + /* + * Suppress advice if our WaitReadBuffers() calls have caught up + * with the first advice we issued for this sequential run. + */ + if (stream->seq_start == InvalidBlockNumber) + suppress_advice = true; + } + else + { + /* Random jump, so start a new sequential run. */ + stream->seq_start = stream->pending_read_blocknum; + } + + if (suppress_advice) + flags = 0; + }Seems a bit confusing to first set
flags = READ_BUFFERS_ISSUE_ADVICE
to then later unset it again. Maybe just set it in if (!suppress_advice)?
Yeah that was a bit too tangly. I found a better expression of that
logic, which also removed that annoying suppress_advice function
argument. I hope this is much clearer.
* Skip the initial ramp-up phase if the caller says we're going to be @@ -825,6 +842,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) distance = stream->distance * 2; distance = Min(distance, stream->max_pinned_buffers); stream->distance = distance; + + /* + * If we've caught up with the first advice issued for the current + * sequential run, cancel further advice until the next random + * jump. The kernel should be able to see the pattern now that + * we're issuing sequential preadv() calls. + */ + if (stream->ios[io_index].op.blocknum == stream->seq_start) + stream->seq_start = InvalidBlockNumber;So stream->seq_start doesn't really denote the start of sequentialness, it
denotes up to where the caller needs to process before we disable sequential
access. Maybe add a comment to it and rename it to something like
->seq_until_processed?
WFM.
Other than this the approach seems to make sense!
Cool, so I'm planning to start pushing the earlier ones tomorrow.
Here also are the buffer forwarding ones, rebased on top.
Here's some strace art showing the old and new advice for patch 0003.
I traced ANALYZE io_combine_limit=8 and used different
default_statistics_targets to find interesting test cases. The
"bracket" on the right shows a sequential range of blocks. Master
only calls fadvise once per sequential chunk:
fadvise ●──────────────────────╮││ 3 0.000006
││╰─► pread 1 676..676 2 0.000007
fadvise ●─────────────────────╮││ 3 0.000006
││╰──► pread 1 678..678 2 0.000007
fadvise ●────────────────────╮││ 3 0.000007
││╰───► pread 3 680..682 2 0.000031
│╰────► pread 6 684..689 1 0.000015
╰─────► pread 8 691..698 ─╮ 0 0.000018
pread 8 699..706 │ 0 0.000016
fadvise ●────────────────────────╮ │ 1 0.000007
│ pread 8 707..714 │ 1 0.000019
│ pread 7 715..721 ─╯ 1 0.000017
╰─► pread 8 723..730 ─╮ 0 0.000016
pread 8 731..738 │ 0 0.000019
fadvise ●────────────────────────╮ │ 1 0.000007
│ pread 8 739..746 │ 1 0.000018
│ pread 5 747..751 ─╯ 1 0.000013
The patch can call three times when that's the configured I/O
concurrency level, because that controls when the pread() calls catch
up with the first block:
fadvise ●────────────────────╮││ 3 0.000007
││╰───► pread 2 255..256 2 0.000014
fadvise ●───────────────────╮││ 3 0.000007
││╰────► pread 8 258..265 ─╮ 2 0.000035
│╰─────► preadv 8 266..273 │ 1 0.000021
╰──────► pread 8 274..281 │ 0 0.000017
fadvise ●────────────────────────╮ │ 1 0.000007
│ pread 8 282..289 │ 1 0.000017
fadvise ●───────────────────────╮│ │ 2 0.000007
││ pread 6 290..295 ─╯ 2 0.000015
fadvise ●──────────────────────╮││ 3 0.000007
││╰─► pread 8 297..304 ─╮ 2 0.000015
fadvise ●─────────────────────╮││ │ 3 0.000007
││╰──► pread 8 305..312 │ 2 0.000017
Purely sequential streams still get none:
pread 1 0..0 ─╮ 0 0.000016
pread 2 1..2 │ 0 0.000014
pread 4 3..6 │ 0 0.000021
pread 8 7..14 │ 0 0.000034
... blah blah blah ...
pread 8 4455..4462 │ 0 0.000029
pread 8 4463..4470 │ 0 0.000026
pread 8 4471..4478 │ 0 0.000020
pread 1 4479..4479 ─╯ 0 0.000010
Attachments:
v3-0001-Improve-buffer-manager-API-for-backend-pin-limits.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Improve-buffer-manager-API-for-backend-pin-limits.patchDownload+80-26
v3-0002-Respect-pin-limits-accurately-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Respect-pin-limits-accurately-in-read_stream.c.patchDownload+90-15
v3-0003-Improve-read-stream-advice-for-large-random-chunk.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Improve-read-stream-advice-for-large-random-chunk.patchDownload+50-22
v3-0004-Look-ahead-more-when-sequential-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v3-0004-Look-ahead-more-when-sequential-in-read_stream.c.patchDownload+33-60
v3-0005-Support-buffer-forwarding-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v3-0005-Support-buffer-forwarding-in-read_stream.c.patchDownload+92-11
v3-0006-Support-buffer-forwarding-in-StartReadBuffers.patchtext/x-patch; charset=US-ASCII; name=v3-0006-Support-buffer-forwarding-in-StartReadBuffers.patchDownload+91-39
I have pushed the new pin limit patches, after some more testing and
copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a
change I'd like to make but it didn't belong in this commit) and
dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping
comprehension and isn't even true for the local buffer version (which
I still think might be an issue, will come back to that, but again it
seemed independent).
For the record, here's a way to see 92fc6856^ or v17 misbehave and pin
too many buffers without involving any other proposed patches, only
v17 streaming seq scan: with shared_buffers=16MB, max_connections=4,
which gives MaxProportionalBuffers == 44, the attached shows three
cursors each pinning io_combine_limit = 32 buffers for a total of 96
at peak. That's because each cursor starts a stream and sees (the
only time it would check) that it is allowed 44, and then we fetch in
round-robin order until they all ramp up to full I/O size. In v17 we
can't really do much worse than that and it requires pretty contrived
settings and workload for it to be a real issue AFAIK but obviously
and hopefully we soon will eg BHS and more.
Attachments:
Hi,
On 2025-03-14 22:03:15 +1300, Thomas Munro wrote:
I have pushed the new pin limit patches, after some more testing and
copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a
change I'd like to make but it didn't belong in this commit) and
dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping
comprehension and isn't even true for the local buffer version (which
I still think might be an issue, will come back to that, but again it
seemed independent).
Something odd:
I was looking to push a test that increases localbuf.c coverage, which passed
with a previous version of these changes. However, it failed, and it did so
on FreeBSD alone. The same test doesn't fail when tested together with the
rest of the AIO work.
https://cirrus-ci.com/build/5951090857869312
https://cirrus-ci.com/task/6177637229395968
I do not understand what could be OS dependent here. I tried to build with
exactly the same options as CI on freebsd, but couldn't repro the issue.
It's perhaps worth noting that this failed even before my recent localbuf:
commits.
I ran CI with a patch that adds a bunch of debug information and just runs the
relevant test:
https://cirrus-ci.com/task/6516935619248128
2025-03-17 16:19:14.270 UTC client backend[5526] pg_regress/temp LOG: statement: SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: NlocalPinnedBuffers=98++
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: NlocalPinnedBuffers=99--
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: pgsr create 0xde34f1f57f8: test_temp, max_pinned: 100, NLocPin: 98
2025-03-17 16:19:14.271 UTC client backend[5526] pg_regress/temp DEBUG: pgsr 0xde34f1f57f8: buffer_limit: 2, pinned_buffers: 0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 98
comparing that with a local run:
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp LOG: statement: SELECT count(*), max(a) max_a, min(a) min_a, max(cnt) max_cnt FROM test_temp;
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr create 0x56029555cad8: test_temp, max_pinned: 100, NLocPin: 97
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: buffer_limit: 3, pinned_buffers: 0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 97
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: StartReadBuffers: wait: 0, pinned: 0 += 1, distance: 1
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: buffer_limit: 3, pinned_buffers: 0, max_pinned: 100, ios_i_p: 0, distance: 1, pending_read_nblocks: 1, NLocPin: 97
2025-03-17 12:18:55.989 EDT client backend[4117093] pg_regress/temp DEBUG: pgsr 0x56029555cad8: StartReadBuffers: wait: 0, pinned: 0 += 1, distance: 1
So one thing is that the pin count differs by 1 at the start of the scan. No
idea why.
But knowing that I was able to repro the issue locally too, by just ensuring
the pin count is one higher by the start of the query.
I still don't know what drives the difference between freebsd and the rest,
but IIUC the reason this fails is just that we are holding too many buffers
pinned, due to some buffers being pinned outside of read_stream.c.
Greetings,
Andres Freund
On Tue, Mar 18, 2025 at 5:56 AM Andres Freund <andres@anarazel.de> wrote:
So one thing is that the pin count differs by 1 at the start of the scan. No
idea why.I still don't know what drives the difference between freebsd and the rest,
but IIUC the reason this fails is just that we are holding too many buffers
pinned, due to some buffers being pinned outside of read_stream.c.
I couldn't reproduce this on my local FreeBSD box, but I think I see
one part of the problem: the cursor a few lines up has a stream with a
higher distance holding a couple of pins. Not sure how the local
buffer pool got into a state that caused that if it isn't doing the
same on other machines, but anyway, if I read the test right you
intend to pin strictly one page per cursor, so I tried saying so
explicitly:
- query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM
test_temp WHERE ctid >= '( %s, 1)'::tid $q$, cursorname, i);
+ query = format($q$DECLARE %I CURSOR FOR SELECT ctid FROM
test_temp WHERE ctid between '(%s, 1)'::tid and '(%s, 9999)'::tid $q$,
cursorname, i, i);
That passed on the FreeBSD CI task.
Hi,
On 2025-03-14 22:03:15 +1300, Thomas Munro wrote:
I have pushed the new pin limit patches, after some more testing and
copy editing. I dropped an unnecessary hunk (in read_stream_reset(), a
change I'd like to make but it didn't belong in this commit) and
dropped the word "Soft" from GetSoftPinLimit() as it wasn't helping
comprehension and isn't even true for the local buffer version (which
I still think might be an issue, will come back to that, but again it
seemed independent).
I found an, easy to fix, issue in read_stream.c. If the limit returned by
GetAdditionalPinLimit() is very large, the buffer_limit variable in
read_stream_start_pending_read() can overflow.
While the code is careful to limit buffer_limit PG_INT16_MAX, we subsequently
add the number of forwarded buffers:
if (stream->temporary)
buffer_limit = Min(GetAdditionalLocalPinLimit(), PG_INT16_MAX);
else
buffer_limit = Min(GetAdditionalPinLimit(), PG_INT16_MAX);
Assert(stream->forwarded_buffers <= stream->pending_read_nblocks);
buffer_limit += stream->forwarded_buffers;
I think there's a second, theoretical, overflow issue shortly after:
/* Shrink distance: no more look-ahead until buffers are released. */
new_distance = stream->pinned_buffers + buffer_limit;
if (stream->distance > new_distance)
stream->distance = new_distance;
while that was hit in the case I was looking at, it was only hit because
buffer_limit had already wrapped around into the negative. I don't think
nblocks can be big enough to really hit this.
I don't actually see any reason for buffer_limit to be a 16bit quantity? It's
just to clamp things down, right?
Greetings,
Andres Freund
On Thu, Apr 3, 2025 at 11:17 AM Andres Freund <andres@anarazel.de> wrote:
I don't actually see any reason for buffer_limit to be a 16bit quantity? It's
just to clamp things down, right?
Ugh. It might be worth just flipping this whole thing over to ints,
let me look into that...
Hi,
On 2025-04-03 14:43:40 +1300, Thomas Munro wrote:
On Thu, Apr 3, 2025 at 11:17 AM Andres Freund <andres@anarazel.de> wrote:
I don't actually see any reason for buffer_limit to be a 16bit quantity? It's
just to clamp things down, right?Ugh. It might be worth just flipping this whole thing over to ints,
let me look into that...
If you don't mind I'll push the obvious minimal fix soon - it's a confusing
crash when one encounters this issue...
Greetings,
Andres Freund
On Mon, Apr 7, 2025 at 1:29 PM Andres Freund <andres@anarazel.de> wrote:
On 2025-04-03 14:43:40 +1300, Thomas Munro wrote:
On Thu, Apr 3, 2025 at 11:17 AM Andres Freund <andres@anarazel.de> wrote:
I don't actually see any reason for buffer_limit to be a 16bit quantity? It's
just to clamp things down, right?Ugh. It might be worth just flipping this whole thing over to ints,
let me look into that...If you don't mind I'll push the obvious minimal fix soon - it's a confusing
crash when one encounters this issue...
Please go ahead, sorry I got distracted trying to fix something else...