Streaming read-ready sequential scan code
Hi,
Last year, David and I worked on a round of refactoring for
heapgettup() and heapgettup_pagemode() [1]/messages/by-id/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com. Now that the streaming
read API has been proposed [2]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com, there is a bit more refactoring that
can be done on master to prepare sequential scan to support streaming
reads.
Patches 0001 and 0002 in the attached patchset do this new round of
refactoring. 0003 is the remainder of the streaming read API that is
not yet in master. 0004 is the sequential scan streaming read user.
The primary change needed to be able to drop in streaming read support
was that heapgettup() and heapgettup_pagemode() have to wait for there
to be no more valid buffers instead of waiting until there were no
more valid BlockNumbers to know that the relation has been entirely
processed. Naturally, streaming reads prefetch ahead of the block
being currently processed by the scan, so all blocks should have been
requested long before all blocks have been processed.
To change this, I split up heapgetpage() into two functions -- one
responsible for getting blocks into buffers and the other for
processing a page (pruning, checking tuple visibility, etc). As a
consequence, I had to change the other caller of heapgetpage() (sample
scans). Since I was doing this anyway, I made a few changes there. It
is arguable that those changes could be split up differently between
0001 and 0004. However, I wanted 0004 to be *only* the sequential scan
streaming read user code.
There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans (see TODO in 0004).
However, I thought I would keep this thread focused on 0001 and 0002.
Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.
patch
15.49% postgres postgres [.] ExecInterpExpr
11.03% postgres postgres [.] heapgettup_pagemode
10.85% postgres postgres [.] ExecStoreBufferHeapTuple
9.14% postgres postgres [.] heap_getnextslot
8.39% postgres postgres [.] heapbuildvis
6.47% postgres postgres [.] SeqNext
master
14.16% postgres postgres [.] ExecInterpExpr
11.54% postgres postgres [.] heapgettup_pagemode
10.63% postgres postgres [.] ExecStoreBufferHeapTuple
10.22% postgres postgres [.] heap_getnextslot
8.53% postgres postgres [.] heapgetpage
5.35% postgres postgres [.] SeqNext
- Melanie
[1]: /messages/by-id/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
[2]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
Attachments:
v1-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload+33-43
v1-0003-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Streaming-Read-API.patchDownload+986-239
v1-0004-Sequential-scans-support-streaming-read.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Sequential-scans-support-streaming-read.patchDownload+100-10
v1-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Split-heapgetpage-into-two-parts.patchDownload+74-45
On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
<melanieplageman@gmail.com> wrote:
Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.
Can you also run a test on a Seqscan with a filter that filters out
all tuples? There's less overhead in other parts of the executor with
such a query.
David
On Mon, Jan 29, 2024 at 4:24 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
<melanieplageman@gmail.com> wrote:Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.Can you also run a test on a Seqscan with a filter that filters out
all tuples? There's less overhead in other parts of the executor with
such a query.
Yes, of course. Thank you so much for taking a look!
While I was at it, I changed the table schema to be entirely composed
of INT type columns and regenerated the data. Note that, both in this
example and my previous example, I ensured that the table was vacuumed
beforehand (and autovacuum disabled for the table) so there wasn't any
on-access pruning happening (heapgetpage() does that in pagemode).
This is the schema
CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g
INT) with (autovacuum_enabled=false);
I added 46000000 rows to the table, making it 2.6 GB. Shared buffers
is double that. Before profiling, I did a SELECT * from the table with
the large sequential scan bulkread optimization disabled. Then I
vacuumed the table. Finally, I turned up parallel_setup_cost high
enough to disable query parallelism.
The query I profiled was:
SELECT * FROM foo WHERE id = 0;
With the data I generated, 0 rows match that condition.
Profiles below. Execution time essentially the same.
patch:
17.08% postgres postgres [.] ExecInterpExpr
11.17% postgres postgres [.] tts_buffer_heap_getsomeattrs
10.64% postgres postgres [.] ExecStoreBufferHeapTuple
9.82% postgres postgres [.] heap_getnextslot
9.13% postgres postgres [.] heapgettup_pagemode
8.98% postgres postgres [.] heapbuildvis
5.40% postgres postgres [.] HeapCheckForSerializableConflictOut
5.16% postgres postgres [.] SeqNext
master:
17.89% postgres postgres [.] ExecInterpExpr
12.28% postgres postgres [.] tts_buffer_heap_getsomeattrs
10.54% postgres postgres [.] ExecStoreBufferHeapTuple
10.11% postgres postgres [.] heapgettup_pagemode
8.52% postgres postgres [.] heapgetpage
8.28% postgres postgres [.] heap_getnextslot
5.00% postgres postgres [.] HeapCheckForSerializableConflictOut
4.71% postgres postgres [.] SeqNext
- Melanie
On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans
I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.
Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API
Option B) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context. It
only allocates the streaming read object if it is a forwards scan. It
frees the streaming read object if the scan direction is later
changed.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API
Option C) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context.
- All scan directions support streaming. To do this, the scan
direction has to be tracked and we must check if the direction has
changed on every heapgettup[_pagemode]() invocation to avoid returning
wrong results.
- No "fallback" codepath as all heap sequential scans will use the
streaming read API
As you can see, each option has pros and cons. I'm interested in what
others think about which we should choose.
- Melanie
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.
It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Feb 20, 2024 at 6:13 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?
Basically, that is what pg_streaming_read_free() does. It goes through
and releases the buffers it had pinned and frees any per-buffer data
allocated.
The complexity with the sequential scan streaming read user and scan
direction is just that it has to detect when the scan direction
changes and do the releasing/freeing and reallocation. The scan
direction is passed to heapgettup[_pagemode](), so this is something
that can change on a tuple-to-tuple basis.
It is less that doing this is complicated and more that it is annoying
and distracting to have to check for and handle a very unimportant and
uncommon case in the main path of the common case.
- Melanie
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scansI've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API
Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.
- Melanie
Attachments:
v2-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload+35-45
v2-0004-Add-pg_streaming_read_reset.patchtext/x-patch; charset=US-ASCII; name=v2-0004-Add-pg_streaming_read_reset.patchDownload+19-1
v2-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Split-heapgetpage-into-two-parts.patchDownload+72-45
v2-0005-Sequential-scans-support-streaming-read.patchtext/x-patch; charset=US-ASCII; name=v2-0005-Sequential-scans-support-streaming-read.patchDownload+97-10
v2-0003-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Streaming-Read-API.patchDownload+986-239
On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scansI've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read APIAttached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.
Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]/messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.
The streaming read is allocated in heap_beginscan() and then reset on
rescan and when the scan direction changes. I only check if the scan
direction changes when a new page is needed. This implementation means
no fallback method is needed, so we can remove the non-streaming read
code for heap sequential scans.
Because heapgettup() and heapgettup_pagemode() are also used for TID
range scans, this patch also happens to implement streaming reads for
TID range scans.
- Melanie
[1]: /messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com
Attachments:
v3-0001-Split-heapgetpage-into-two-parts.patchtext/x-diff; charset=us-asciiDownload+72-45
v3-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-diff; charset=us-asciiDownload+35-45
v3-0003-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload+1218-212
v3-0004-Add-pg_streaming_read_reset.patchtext/x-diff; charset=us-asciiDownload+19-1
v3-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-diff; charset=us-asciiDownload+90-11
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scansI've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read APIAttached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.
While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.
- Melanie
Attachments:
v4-0004-Add-pg_streaming_read_reset.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Add-pg_streaming_read_reset.patchDownload+19-1
v4-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload+35-45
v4-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Split-heapgetpage-into-two-parts.patchDownload+72-45
v4-0003-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Streaming-Read-API.patchDownload+1218-212
v4-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchDownload+93-12
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scansI've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read APIAttached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.
I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.
Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.
- Melanie
Attachments:
v5-0001-Split-heapgetpage-into-two-parts.patchtext/x-diff; charset=us-asciiDownload+72-45
v5-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-diff; charset=us-asciiDownload+35-45
v5-0003-fixup-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload+1217-212
v5-0004-Add-pg_streaming_read_reset.patchtext/x-diff; charset=us-asciiDownload+19-1
v5-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-diff; charset=us-asciiDownload+93-12
On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scansI've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read APIAttached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.
Attached v6 has the version of the streaming read API mentioned here
[1]: /messages/by-id/CA+hUKGJTwrS7F=uJPx3SeigMiQiW+LJaOkjGyZdCntwyMR=uAw@mail.gmail.com
investigated in that thread by Andres, Bilal, and Thomas.
The one outstanding item for the sequential scan streaming read user
is deciding how the BAS_BULKREAD buffer access strategy should
interact with the streaming read infrastructure. We discussed a bit
off-list, and it seems clear that the ring must be at least as large
as io_combine_limit. This should be no problem for BAS_BULKREAD
because its ring is 16 MB. The question is whether or not we need to
do anything right now to ensure there aren't adverse interactions
between io_combine_limit, max_ios, and the buffer access strategy ring
buffer size.
- Melanie
[1]: /messages/by-id/CA+hUKGJTwrS7F=uJPx3SeigMiQiW+LJaOkjGyZdCntwyMR=uAw@mail.gmail.com
Attachments:
v6-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Split-heapgetpage-into-two-parts.patchDownload+72-45
v6-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload+35-45
v6-0004-Provide-API-for-streaming-relation-data.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Provide-API-for-streaming-relation-data.patchDownload+829-12
v6-0003-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Provide-vectored-variant-of-ReadBuffer.patchDownload+563-223
v6-0005-Add-read_stream_reset.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Add-read_stream_reset.patchDownload+20-1
v6-0006-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-patch; charset=US-ASCII; name=v6-0006-Sequential-scans-and-TID-range-scans-stream-reads.patchDownload+93-11
On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote:
On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scansI've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read APIAttached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.Attached v6 has the version of the streaming read API mentioned here
[1]. This resolved the fully-in-shared-buffers regressions
investigated in that thread by Andres, Bilal, and Thomas.
Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.
I noticed that 0001 in the set posed a small regression from master for
a sequential scan of a relation already in shared buffers. While
investigating this, I saw that heapfetchbuf() was still not being
inlined (compiled at -O2) and when I promoted heapfetchbuf() from static
inline to static pg_attribute_always_inline, most of the very small
regression I saw went away. I don't know if I squashed the issue
entirely, though.
- Melanie
Attachments:
v7-0001-Split-heapgetpage-into-two-parts.patchtext/x-diff; charset=us-asciiDownload+72-45
v7-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-diff; charset=us-asciiDownload+35-45
v7-0003-v14-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload+1484-234
v7-0004-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-diff; charset=us-asciiDownload+95-11
On 01/04/2024 22:58, Melanie Plageman wrote:
Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.
I saw benchmarks in this thread to show that there's no regression when
the data is in cache, but I didn't see any benchmarks demonstrating the
benefit of this. So I ran this quick test:
-- create table ~1 GB table with only 1 row per page.
CREATE TABLE giga (i int, filler text) with (fillfactor=10);
insert into giga select g, repeat('x', 900) from generate_series(1,
140000) g;
vacuum freeze giga;
\timing on
select count(*) from giga;
The SELECT takes about 390 ms on 'master', and 230 ms with the patch.
This is pretty much the best case for this patch, real world gains will
be much smaller. Nevertheless, nice speedup!
--
Heikki Linnakangas
Neon (https://neon.tech)
On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 01/04/2024 22:58, Melanie Plageman wrote:
Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.I saw benchmarks in this thread to show that there's no regression when
the data is in cache, but I didn't see any benchmarks demonstrating the
benefit of this. So I ran this quick test:
Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).
On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.
Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.
- Melanie
Attachments:
v8-0003-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Sequential-scans-and-TID-range-scans-stream-reads.patchDownload+97-13
v8-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload+35-45
v8-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Split-heapgetpage-into-two-parts.patchDownload+72-45
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 01/04/2024 22:58, Melanie Plageman wrote:
Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.I saw benchmarks in this thread to show that there's no regression when
the data is in cache, but I didn't see any benchmarks demonstrating the
benefit of this. So I ran this quick test:Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).
Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE. That's great!, but it doesn't seem to be
due to intentional changes. No efficiency is coming from batching
anything. Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out? (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)
On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.
Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:
if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}
I just don't know where to get that '2'. The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.
Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true. The size of main
memory and L2 cache have increased dramatically since those strategies
were invented. I think we should at least double them, and more
likely quadruple them. I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c. It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.
On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}I just don't know where to get that '2'. The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.
Yea, I don't actually understand why not just use strategy_buffers - 1
or something. 1/2 seems like a big limiting factor for those
strategies with small rings.
I don't really think it will come up for SELECT queries since they
rely on readahead and not prefetching.
It does seem like it could easily come up for analyze.
But I am on board with the idea of clamping for sequential scan/TID
range scan. For vacuum, we might have to think harder. If the user
specifies buffer_usage_limit and io_combine_limit and they are never
reaching IOs of io_combine_limit because of their buffer_usage_limit
value, then we should probably inform them.
- Melanie
Hi,
On 2024-04-04 09:27:35 +1300, Thomas Munro wrote:
Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE. That's great!, but it doesn't seem to be
due to intentional changes. No efficiency is coming from batching
anything. Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?
Another theory is that it's due to the plain ReadBuffer() path needing to do
RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't
need to.
On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}
I just don't know where to get that '2'. The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.
The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?
I think it's fine to add a handwavy justification for the 2, saying that we
want to balance readahead distance and reducing WAL write frequency, and that
at some point more sophisticated logic will be needed.
Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.
I'm not sure it's right tying them together. The concerns for both are fairly
different.
The size of main memory and L2 cache have increased dramatically since those
strategies were invented. I think we should at least double them, and more
likely quadruple them. I realise you already made them configurable per
command in commit 1cbbee033, but I mean the hard coded default 256 in
freelist.c. It's not only to get more space for our prefetching plans, it's
also to give the system more chance of flushing WAL asynchronously/in some
other backend before you crash into dirty data; as you discovered,
prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy,
by taking away space that is effectively deferring WAL flushes, so I'd at
least like to double the size for if we use "/ 2" above.
I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.
Greetings,
Andres Freund
On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote:
Attached v8 is rebased over current master (which now has the
streaming read API).
I've looked at the v8-0001 patch.
I wasn't too keen on heapbuildvis() as a function name for an external
function. Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work. Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().
Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).
I also did a round of comment adjustments as there were a few things I
didn't like, e.g:
+ * heapfetchbuf - subroutine for heapgettup()
This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called. I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:
/*
* heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
*
* Read the specified block of the scan relation into a buffer and pin that
* buffer before saving it in the scan descriptor.
*/
I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".
The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock. These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason. Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.
I've attached v9-0001 and a delta of just my changes from v8.
David
On Wed, Apr 3, 2024 at 9:08 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote:
Attached v8 is rebased over current master (which now has the
streaming read API).I've looked at the v8-0001 patch.
Thanks for taking a look!
I wasn't too keen on heapbuildvis() as a function name for an external
function. Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work. Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().
Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.
Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).I also did a round of comment adjustments as there were a few things I
didn't like, e.g:+ * heapfetchbuf - subroutine for heapgettup()
This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called. I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:/*
* heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
*
* Read the specified block of the scan relation into a buffer and pin that
* buffer before saving it in the scan descriptor.
*/I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".
That all is fine with me.
The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock. These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason. Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.I've attached v9-0001 and a delta of just my changes from v8.
All sounds good and LGTM
- Melanie
On Thu, 4 Apr 2024 at 14:38, Melanie Plageman <melanieplageman@gmail.com> wrote:
Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.
I ended up calling it heap_prepare_pagescan() as I started to think
prep/prepare should come first. I don't think it's perfect as the
intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but
that's obviously too long.
I've pushed the v9-0001 with that rename done.
David