BAS_BULKREAD vs read stream

Started by Andres Freund9 months ago7 messages

andres@anarazel.de

9 months ago

Hi,

There are two issues with BAS_BULKREAD interactions with read stream. One for
17+ and one for 18+.

The 17+ issue:

While trying to analyze issues with BAS_BULKREAD's size vs AIO I was getting
really confusing results. After an embarassingly long time I figured out that
a good chunk of those were due to BAS_BULKREAD not actually limiting buffer
usage - we'd end up filling shared buffers despite using the strategy.

There are two reasons for that:

- read_stream.c doesn't account for the buffer it is returning from
read_stream_next_buffer()

If we subtract 1 from GetAccessStrategyPinLimit() this avenue for escaping
the strategy is closed. The amount of "leakage" reduces very substantially.

- read_stream.c doesn't know about buffer pins held externally. For sequential
scans the buffer pin held by the slot passed to heap_getnextslot() will
occasionally cause a buffer to be still pinned when the strategy wants to
reuse the buffer, preventing reuse.

This causes a slower "leak", but it's still rather substantial.

Obviously there can be plans that pin buffers for longer, but that's
relatively rare and nothing new. Whereas leaking buffer even with a plain
seqscan is new in 17.

This issue exists in 17, but is harder to hit, because we don't actually read
ahead with sequential scans. If we just do a single read at a time, we don't
pin more than io_combine_limit IOs. The default io_combine_limit is 16,
whereas BAS_BULKREAD is 32 buffers with the default BLCKSZ=8192. As soon as
one uses io_combine_limit=32 the issue reproduces on 17.

With AIO this obviously becomes easier to hit.

I think medium term we need to overhaul how strategies work rather
substantially. The whole concept isn't quite right [1]The whole idea of reusing buffers after a set distance is close to nonsensical for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned anymore), for cache efficiency. The only reason deferring the reuse makes *some* sense is [2]..

But for now we need a solution that's doable for both 17 and 18. While not
pretty, I think the best short term solution is to just subtract 2 from the
strategy limit. That seems to work for all common uses of BAS_BULKREAD
currently.

I suspect that we should subtract one in read_stream.c (to account for the
buffer returned by read_stream_next_buffer()) and one in
GetAccessStrategyPinLimit() (to account for pins held externally).

The 18 issue:

Right now there's a noticeable performance drop when going from a seqscan that
doesn't use BAS_BULKREAD to one that uses BAS_BULKREAD. This is very
pronounced when not hitting the OS page cache, but noticeable even otherwise.

The reason for that slowdown is that BAS_BULKREAD is so small that it just
allows 1-2 IOs in flight (io_combine_limit = 16 / 128kB vs BAS_BULKREAD of
256kB). Whereas, with a smaller table that doesn't hit BAS_BULKREAD, we can
have 16 concurrent with the default settings, assuming a large enough shared
buffers (on very small s_b we'll limit how many buffers we pin).

The obvious solution to that would be to increase BAS_BULKREAD substantially
above 256kB.

For quite a while was worried about increasing the size, because somewhere (I
couldn't find it while writing this email, will add the reference once I
refound it) we have a comment explaining that a small size was chosen because
it helps with CPU cache efficiency.

Initially I could indeed see reasonably sized effects, but a good chunk of
that turned out to be the issue above, where small sizes simply wouln't
actually use the ring and thus have completely different performance
characteristics.

I *can* easily reproduce effects of doing very large reads when the data is in
the page cache, which indeed seems to be related to cache effects, which seems
to be related to some CPU oddities around crossing kernel/user memory, see
[3]: /messages/by-id/yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq@3r54w5zqldwn

I also can reproduce some negative effects of using larger ring sizes for
io_method=sync. A 10GB pg_prewarm(), that I extended with the ability to use
a strategy of a certain size, slows down ~22% when going from 8MB to
16MB. Obviously io_method=sync does not benefit from a larger ring size, as
it just executes IO's synchronous, which is not limited by the ring size.

The slowdown for io_method=worker for very small ring sizes is substantial,
whether the buffer is in the page cache or not. It simply limits the
asynchronizity too much. At 256kB io_method=worker is ~50% slower than
io_method=sync, at 512 it's 20% faster, at 1MB io_method=worker is 2.5x
faster.

With io_uring there is is a 7% regression at 128kB (lower than the current
default), at 256kB io_uring is 5% faster, reaching 1.9x at 3MB.

I think we should consider increasing BAS_BULKREAD TO something like
Min(256, io_combine_limit * (effective_io_concurrency + 1))

As I said earlier, I think we need to redesign strategies, but this seems to
address the regression when going from no-strategy to strategy, without
causing any meaninful regressions.

I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
couldn't come up with any benefits. It seems to hurt fairly quickly.

Greetings,

Andres Freund

[1]: The whole idea of reusing buffers after a set distance is close to nonsensical for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned anymore), for cache efficiency. The only reason deferring the reuse makes *some* sense is [2].

The whole idea of reusing buffers after a set distance is close to nonsensical
for BAS_BULKREAD - we should reuse them as soon as viable (i.e. not pinned
anymore), for cache efficiency. The only reason deferring the reuse makes
*some* sense is [2]For synchronize_seqscans to work, the BAS_BULKREAD size needs to be smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way too small with IO rates achievable on even a cheap laptop. Cheap storage can do reads at gigabytes/second and one backend can process many hundreds of MB each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the BAS_BULKREAD makes it very likely that we have to re-read the buffer from the OS. That's bad with buffered IO, catastrophic with direct IO..

It doesn't even really make sense for the other BAS_'s. What we should control
there is the amount of dirty buffers and how to flush WAL at a sensible
rate. But with BAS_VACUUM that doesn't work at all, as it's used for both
reads and writes. If vacuum does't modify buffers, we can end up flushing WAL
way too frequently and if we only rarely modify, the amount of buffers in the
ring is too big.

[2]: For synchronize_seqscans to work, the BAS_BULKREAD size needs to be smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way too small with IO rates achievable on even a cheap laptop. Cheap storage can do reads at gigabytes/second and one backend can process many hundreds of MB each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the BAS_BULKREAD makes it very likely that we have to re-read the buffer from the OS. That's bad with buffered IO, catastrophic with direct IO.

For synchronize_seqscans to work, the BAS_BULKREAD size needs to be
smaller than SYNC_SCAN_REPORT_INTERVAL. The factor is 2x right now - way way
too small with IO rates achievable on even a cheap laptop. Cheap storage can
do reads at gigabytes/second and one backend can process many hundreds of MB
each second. Having only 128kB between SYNC_SCAN_REPORT_INTERVAL and the
BAS_BULKREAD makes it very likely that we have to re-read the buffer from the
OS. That's bad with buffered IO, catastrophic with direct IO.

On a local NVMe SSDs on a large table with the normal BAS_BULKREAD I often get
effectively *no* buffer hits when running two concurrent seqscans, even if I
disable query parallelism.

Sometimes in the 10s, sometimes in the low thousands, always < 0.1%. At 16MB
I get 25% hits, at 128MB it's close to 50% (the max you could get).

With parallelism I see very low hit rates even with 16MB, only around but with
256MB it starts to get better.

[3]: /messages/by-id/yhklc3wuxt4l42tpah37rzsxoycresoiae22h2eluotrwr37gq@3r54w5zqldwn

Melanie Plageman

melanieplageman@gmail.com

9 months ago

In reply to: Andres Freund (#1)

Re: BAS_BULKREAD vs read stream

On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:

I think we should consider increasing BAS_BULKREAD TO something like
Min(256, io_combine_limit * (effective_io_concurrency + 1))

Do you mean Max? If so, this basically makes sense to me.
Overall, I think even though the ring is about reusing buffers, we
have to think about how many IOs that reasonably is -- which this
formula does.

You mentioned testing with 8MB, did you see some sort of clipp
anywhere between 256 and 8MB?

I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
couldn't come up with any benefits. It seems to hurt fairly quickly.

So, how will you deal with it when the BAS_BULKREAD ring is bigger?

- Melanie

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Melanie Plageman (#2)

Re: BAS_BULKREAD vs read stream

Hi,

On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:

On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:

I think we should consider increasing BAS_BULKREAD TO something like
Min(256, io_combine_limit * (effective_io_concurrency + 1))

Do you mean Max? If so, this basically makes sense to me.

Err, yes.

I was wondering whether we should add a Max(SYNC_SCAN_REPORT_INTERVAL, ...),
but it's a private value, and the proposed formula doesn't really change
anything for SYNC_SCAN_REPORT_INTERVAL. So I think it's fine.

Overall, I think even though the ring is about reusing buffers, we
have to think about how many IOs that reasonably is -- which this
formula does.

Right - the prior limit kinda somewhat made sense before we had IO combining,
but after that *and* having AIO it is clearly obsoleted.

You mentioned testing with 8MB, did you see some sort of clipp anywhere
between 256 and 8MB?

There's not really a single cliff.

For buffered, fully cached IO:

With io_method=sync, it gets way better between 64 and 128kB, then gets worse
between 128kB and 256kB (the current value), and then seems to gradually gets
worse starting somewhere around 8MB. 32MB is 50% slower than 8MB...

io_method=worker is awful with 64-128kB, not great at 256kB and then is very
good. There's a 10% decline from 16MB->32MB.

io_method=io_uring is similar to sync at 64-128kB, very good from then on. I
do see a 6% decline from 16MB->32MB.

I suspect the 16-32MB cliff is due to L3 related effects, which is 13.8M per
per socket (of which I have 2). It's not entirely clear what that effect is -
all the additional cycles are spent in the kernel, not in userspace. I
strongly suspect it's related to SMAP [1]https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention, but I don't really understand the
details. All I know is that disabling SMAP removes this cliff on several Intel
and AMD systems, both client and server CPUs.

For buffered, non-cached IO:

io_method=sync: I see no performance difference across all ring sizes.

io_method=worker: Performance is ~12% worse than sync at <= 256kB, 1.36x
faster at 512kB, 2.07x at 1MB, 3.0x at 4MB, and then it stays the same
up to 64MB.

io_method=io_uring: equivalent to sync at <= 256kB, 1.54x faster at 512kB,
3.2x faster at 4MB and stays the same up to 64MB.

For DIO/unbuffered IO:

As io_method=sync, obviously, doesn't do DIO/unbuffered IO in a reasonable
way, it doesn't make sense to compare it. So I'm comparing to buffered IO.

io_method=worker: Performance is terrifyingly bad at 128kB (like 0.41x the
throughput of buffered IO), slightly worse than buffered at 256kB, Best perf
is reached at 4MB and stays very consistent after that.

io_method=uring: Performance is terrifyingly bad at <= 256kB (like 0.43x the
throughput of buffered IO) and starts to be decent after that. Best perf is
reached at 4MB and stays very consistent after that.

The peak perf of buffered but uncached IO and DIO is rather close, as
I'm testing this on a PCIe3 drive.

The difference in CPU cycles is massive though:

worker buffered:

9,850.27 msec cpu-clock # 3.001 CPUs utilized
305,050 context-switches # 30.969 K/sec
51,049 cpu-migrations # 5.182 K/sec
11,530 page-faults # 1.171 K/sec
16,615,532,455 instructions # 0.84 insn per cycle (30.72%)
19,876,584,840 cycles # 2.018 GHz (30.75%)
3,256,065,951 branches # 330.556 M/sec (30.78%)
26,046,144 branch-misses # 0.80% of all branches (30.81%)
4,452,808,846 L1-dcache-loads # 452.050 M/sec (30.83%)
574,304,216 L1-dcache-load-misses # 12.90% of all L1-dcache accesses (30.82%)
169,117,254 LLC-loads # 17.169 M/sec (30.82%)
82,769,152 LLC-load-misses # 48.94% of all LL-cache accesses (30.82%)
377,137,247 L1-icache-load-misses (30.78%)
4,475,873,620 dTLB-loads # 454.391 M/sec (30.76%)
5,496,266 dTLB-load-misses # 0.12% of all dTLB cache accesses (30.73%)
9,765,507 iTLB-loads # 991.395 K/sec (30.70%)
7,525,173 iTLB-load-misses # 77.06% of all iTLB cache accesses (30.70%)

3.282465335 seconds time elapsed

worker DIO:
9,783.05 msec cpu-clock # 3.000 CPUs utilized
356,102 context-switches # 36.400 K/sec
32,575 cpu-migrations # 3.330 K/sec
1,245 page-faults # 127.261 /sec
8,076,414,780 instructions # 1.00 insn per cycle (30.73%)
8,109,508,194 cycles # 0.829 GHz (30.73%)
1,585,426,781 branches # 162.058 M/sec (30.74%)
17,869,296 branch-misses # 1.13% of all branches (30.78%)
2,199,974,033 L1-dcache-loads # 224.876 M/sec (30.79%)
167,855,899 L1-dcache-load-misses # 7.63% of all L1-dcache accesses (30.79%)
31,303,238 LLC-loads # 3.200 M/sec (30.79%)
2,126,825 LLC-load-misses # 6.79% of all LL-cache accesses (30.79%)
322,505,615 L1-icache-load-misses (30.79%)
2,186,161,593 dTLB-loads # 223.464 M/sec (30.79%)
3,892,051 dTLB-load-misses # 0.18% of all dTLB cache accesses (30.79%)
10,306,643 iTLB-loads # 1.054 M/sec (30.77%)
6,279,217 iTLB-load-misses # 60.92% of all iTLB cache accesses (30.74%)

3.260901966 seconds time elapsed

io_uring buffered:

9,924.48 msec cpu-clock # 2.990 CPUs utilized
340,821 context-switches # 34.341 K/sec
57,048 cpu-migrations # 5.748 K/sec
1,336 page-faults # 134.617 /sec
16,630,629,989 instructions # 0.88 insn per cycle (30.74%)
18,985,579,559 cycles # 1.913 GHz (30.64%)
3,253,081,357 branches # 327.784 M/sec (30.67%)
24,599,858 branch-misses # 0.76% of all branches (30.68%)
4,515,979,721 L1-dcache-loads # 455.035 M/sec (30.69%)
556,041,180 L1-dcache-load-misses # 12.31% of all L1-dcache accesses (30.67%)
160,198,962 LLC-loads # 16.142 M/sec (30.65%)
75,164,349 LLC-load-misses # 46.92% of all LL-cache accesses (30.65%)
348,585,830 L1-icache-load-misses (30.63%)
4,473,414,356 dTLB-loads # 450.746 M/sec (30.91%)
1,193,495 dTLB-load-misses # 0.03% of all dTLB cache accesses (31.04%)
5,507,512 iTLB-loads # 554.942 K/sec (31.02%)
2,973,177 iTLB-load-misses # 53.98% of all iTLB cache accesses (31.02%)

3.319117422 seconds time elapsed

io_uring DIO:

9,782.99 msec cpu-clock # 3.000 CPUs utilized
96,916 context-switches # 9.907 K/sec
8 cpu-migrations # 0.818 /sec
1,001 page-faults # 102.320 /sec
5,902,978,172 instructions # 1.45 insn per cycle (30.73%)
4,059,940,112 cycles # 0.415 GHz (30.73%)
1,117,690,786 branches # 114.248 M/sec (30.75%)
10,994,087 branch-misses # 0.98% of all branches (30.77%)
1,559,149,686 L1-dcache-loads # 159.374 M/sec (30.78%)
85,057,280 L1-dcache-load-misses # 5.46% of all L1-dcache accesses (30.78%)
11,393,236 LLC-loads # 1.165 M/sec (30.78%)
2,599,701 LLC-load-misses # 22.82% of all LL-cache accesses (30.79%)
174,124,990 L1-icache-load-misses (30.80%)
1,545,148,685 dTLB-loads # 157.942 M/sec (30.79%)
156,524 dTLB-load-misses # 0.01% of all dTLB cache accesses (30.79%)
3,325,307 iTLB-loads # 339.907 K/sec (30.77%)
2,288,730 iTLB-load-misses # 68.83% of all iTLB cache accesses (30.74%)

3.260716339 seconds time elapsed

I'd say a 4.5x reduction in cycles is rather nice :)

I experimented some whether SYNC_SCAN_REPORT_INTERVAL should be increased, and
couldn't come up with any benefits. It seems to hurt fairly quickly.

So, how will you deal with it when the BAS_BULKREAD ring is bigger?

I think I would just leave it at the current value. What I meant with "hurt
fairly quickly" is that *increasing* SYNC_SCAN_REPORT_INTERVAL seems to make
synchronize_seqscans work even less well.

Greetings,

Andres Freund

[1]: https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Andres Freund (#3)

1 attachment(s)

Re: BAS_BULKREAD vs read stream

Hi,

On 2025-04-07 16:28:20 -0400, Andres Freund wrote:

On 2025-04-07 15:24:43 -0400, Melanie Plageman wrote:

On Sun, Apr 6, 2025 at 4:15 PM Andres Freund <andres@anarazel.de> wrote:

I think we should consider increasing BAS_BULKREAD TO something like
Min(256, io_combine_limit * (effective_io_concurrency + 1))

Do you mean Max? If so, this basically makes sense to me.

Err, yes.

In the attached I implemented the above idea, with some small additional
refinements:

- To allow sync seqscans to work at all, we should only *add* to the 256kB
that we currently have - otherwise all buffers in a ring will be undergoing
IO, never allowing two synchronizing scans to actually use the buffers that
the other scan has already read in.

This also kind of obsoletes the + 1 in the formula above, although that is
arguable, particularly for effective_io_concurrency=0.

- If the backend has a PinLimit() that won't allow io_combine_limit *
effective_io_concurrency buffers to undergo IO, it doesn't make sense to
make the ring bigger. At best it would waste space for the ring, at worst
it'd make "ring escapes" inevitable - victim buffer search would always
replace buffers that we have in the ring.

- the multiplication by (BLCKSZ / 1024) that I omitted above is actually
included :)

I unfortunately think we do need *something* to address $subject for 18 - the
performance regression when increasing relation sizes is otherwise just too
big - it's trivial to find queries getting slower by more than 4x. On local,
low-latency NVMe storage - on network storage the regression will often be
bigger.

If we don't do something for 18, only consolation would be that the
performance when using the 256kB BAS_BULKREAD is rather close to the
performance one gets in 17, with or without without a strategy. But I don't
think that would make it less surprising that once your table grows sufficient
to use a strategy your IO throughput craters.

I've some local prototype for the 17/18 "strategy escape" issue, will work on
polishing that soon, unless you have something for that Thomas?

Greetings,

Andres Freund

Attachments:

v2-0001-Increase-BAS_BULKREAD-based-on-effective_io_concu.patchtext/x-diff; charset=us-asciiDownload

From e49f8201413b3b143291ec97d4bea4b1015bcd13 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 18 Mar 2025 14:40:06 -0400
Subject: [PATCH v2] Increase BAS_BULKREAD based on effective_io_concurrency

Before, BAS_BULKREAD was always of size 256kB. With the default
io_combine_limit of 16, that only allowed 1-2 IOs to be in flight -
insufficient even on very low latency storage.

We don't just want to increase the size to much larger hardcoded value, as
very large rings (10s of MBs of of buffers), appear to have negative
performance effects when reading in data that the OS has cached.

To address this, increase the size of BAS_BULKREAD to allow for
io_combine_limit * effective_io_concurrency buffers getting read in. To
prevent the ring being much larger than useful, limit the increased size with
GetPinLimit().

The formula outlined above keeps the ring size to sizes for which we have not
observed performance regressions, unless very large effective_io_concurrency
values are used together with large shared_buffers setting.

Discussion: https://postgr.es/m/lqwghabtu2ak4wknzycufqjm5ijnxhb4k73vzphlt2a3wsemcd@gtftg44kdim6
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
---
 src/backend/storage/buffer/freelist.c | 46 +++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 336715b6c63..e1f8e5e97bd 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -555,8 +555,50 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			return NULL;
 
 		case BAS_BULKREAD:
-			ring_size_kb = 256;
-			break;
+			{
+				int			ring_max_kb;
+
+				/*
+				 * The ring always needs to be large enough to allow some
+				 * separation in time between providing a buffer to the user
+				 * of the strategy and that buffer being reused. Otherwise the
+				 * user's pin will prevent reuse of the buffer, even without
+				 * concurrent activity.
+				 *
+				 * We also need to ensure the ring always is large enough for
+				 * SYNC_SCAN_REPORT_INTERVAL, as noted above.
+				 *
+				 * Thus we start out a minimal size and increase the size
+				 * further if appropriate.
+				 */
+				ring_size_kb = 256;
+
+				/*
+				 * There's no point in a larger ring if we won't be allowed to
+				 * pin sufficiently many buffers.  But we never limit to less
+				 * than the minimal size above.
+				 */
+				ring_max_kb = GetPinLimit() * (BLCKSZ / 1024);
+				ring_max_kb = Max(ring_size_kb, ring_max_kb);
+
+				/*
+				 * We would like the ring to additionally have space for the
+				 * the configured degree of IO concurrency. While being read
+				 * in, buffers can obviously not yet be reused.
+				 *
+				 * Each IO can be up to io_combine_limit blocks large, and we
+				 * want to start up to effective_io_concurrency IOs.
+				 *
+				 * Note that effective_io_concurrency may be 0, which disables
+				 * AIO.
+				 */
+				ring_size_kb += (BLCKSZ / 1024) *
+					io_combine_limit * effective_io_concurrency;
+
+				if (ring_size_kb > ring_max_kb)
+					ring_size_kb = ring_max_kb;
+				break;
+			}
 		case BAS_BULKWRITE:
 			ring_size_kb = 16 * 1024;
 			break;
-- 
2.48.1.76.g4e746b1a31.dirty

Thomas Munro

thomas.munro@gmail.com

9 months ago

In reply to: Andres Freund (#4)

Re: BAS_BULKREAD vs read stream

On Tue, Apr 8, 2025 at 2:20 PM Andres Freund <andres@anarazel.de> wrote:

In the attached I implemented the above idea, with some small additional
refinements:

LGTM.

How I wish EXPLAIN would show some clues about strategies...

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Thomas Munro (#5)

Re: BAS_BULKREAD vs read stream

Hi,

On 2025-04-08 18:11:04 +1200, Thomas Munro wrote:

On Tue, Apr 8, 2025 at 2:20 PM Andres Freund <andres@anarazel.de> wrote:

In the attached I implemented the above idea, with some small additional
refinements:

LGTM.

Thanks for checking.

How I wish EXPLAIN would show some clues about strategies...

Indeed. There will be some interesting piercing of layers to make that work...

Greetings,

Andres Freund

Jakub Wartak

jakub.wartak@enterprisedb.com

9 months ago

In reply to: Andres Freund (#1)

Re: BAS_BULKREAD vs read stream

On Sun, Apr 6, 2025 at 10:15 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

[..]

The obvious solution to that would be to increase BAS_BULKREAD substantially
above 256kB.

For quite a while was worried about increasing the size, because somewhere (I
couldn't find it while writing this email, will add the reference once I
refound it) we have a comment explaining that a small size was chosen because
it helps with CPU cache efficiency.

Hi, FWIW, I was trying to understand the scope of this change and
GetAccessStrategy() actually asks to go to
src/backend/storage/buffer/README which explains the logic behind the
old (pre-commit now) rationale and value. It says
```
For sequential scans, a 256KB ring is used. That's small enough to fit in L2
cache, which makes transferring pages from OS cache to shared buffer cache
efficient.
```

-J.