optimize lookups in snapshot [sub]xip arrays
Hi hackers,
A few years ago, there was a proposal to create hash tables for long
[sub]xip arrays in snapshots [0]/messages/by-id/35960b8af917e9268881cd8df3f88320@postgrespro.ru, but the thread seems to have fizzled out.
I was curious whether this idea still showed measurable benefits, so I
revamped the patch and ran the same test as before [1]/messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com. Here are the
results for 60₋second runs on an r5d.24xlarge with the data directory on
the local NVMe storage:
writers HEAD patch diff
----------------------------
16 659 664 +1%
32 645 663 +3%
64 659 692 +5%
128 641 716 +12%
256 619 610 -1%
512 530 702 +32%
768 469 582 +24%
1000 367 577 +57%
As before, the hash table approach seems to provide a decent benefit at
higher client counts, so I felt it was worth reviving the idea.
The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table. Also, I've bumped up the threshold for creating hash tables to
128 based on the results of my testing. The attached patch waits until a
lookup of [sub]xip before generating the hash table, so we only need to
allocate enough space for the current elements in the [sub]xip array, and
we avoid allocating extra memory for workloads that do not need the hash
tables. I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.
Thoughts?
[0]: /messages/by-id/35960b8af917e9268881cd8df3f88320@postgrespro.ru
[1]: /messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Optimize-lookups-in-snapshot-transactions-in-prog.patchtext/x-diff; charset=us-asciiDownload+114-24
On Wed, Jul 13, 2022 at 10:40 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
Hi hackers,
A few years ago, there was a proposal to create hash tables for long
[sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
I was curious whether this idea still showed measurable benefits, so I
revamped the patch and ran the same test as before [1]. Here are the
results for 60₋second runs on an r5d.24xlarge with the data directory on
the local NVMe storage:writers HEAD patch diff
----------------------------
16 659 664 +1%
32 645 663 +3%
64 659 692 +5%
128 641 716 +12%
256 619 610 -1%
512 530 702 +32%
768 469 582 +24%
1000 367 577 +57%
Impressive.
As before, the hash table approach seems to provide a decent benefit at
higher client counts, so I felt it was worth reviving the idea.The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table. Also, I've bumped up the threshold for creating hash tables to
128 based on the results of my testing. The attached patch waits until a
lookup of [sub]xip before generating the hash table, so we only need to
allocate enough space for the current elements in the [sub]xip array, and
we avoid allocating extra memory for workloads that do not need the hash
tables. I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.Thoughts?
[0] /messages/by-id/35960b8af917e9268881cd8df3f88320@postgrespro.ru
[1] /messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com
Aren't these snapshot arrays always sorted? I see the following code:
/* sort so we can bsearch() */
qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
/* sort so we can bsearch() later */
qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
If the ordering isn't an invariant of these snapshot arrays, can we
also use the hash table mechanism for all of the snapshot arrays
infrastructure rather than qsort+bsearch in a few places and hash
table for others?
+ * The current value worked well in testing, but it's still mostly a guessed-at
+ * number that might need updating in the future.
+ */
+#define XIP_HASH_MIN_ELEMENTS (128)
+
Do you see a regression with a hash table for all the cases? Why can't
we just build a hash table irrespective of these limits and use it for
all the purposes instead of making it complex with different
approaches if we don't have measurable differences in the performance
or throughput?
+static inline bool
+XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt,
+ xip_hash_hash **xiph)
+ /* Make sure the hash table is built. */
+ if (*xiph == NULL)
+ {
+ *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL);
+
+ for (int i = 0; i < xcnt; i++)
Why create a hash table on the first search? Why can't it be built
while inserting or creating these snapshots? Basically, instead of the
array, can these snapshot structures be hash tables by themselves? I
know this requires a good amount of code refactoring, but worth
considering IMO as it removes bsearch thus might improve the
performance further.
Regards,
Bharath Rupireddy.
Hi Bharath,
Thanks for taking a look.
On Thu, Jul 14, 2022 at 03:10:56PM +0530, Bharath Rupireddy wrote:
Aren't these snapshot arrays always sorted? I see the following code:
/* sort so we can bsearch() */
qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);/* sort so we can bsearch() later */
qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);
AFAICT these arrays are sorted in limited cases, such as
pg_current_snapshot() and logical replication. GetSnapshotData() does not
appear to sort them, so I don't think we can always assume they are sorted.
In the previous thread, Tomas analyzed simply sorting the arrays [0]/messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com and
found that it provided much less improvement compared to the hash table
approach, so I have not seriously considered it here.
If the ordering isn't an invariant of these snapshot arrays, can we
also use the hash table mechanism for all of the snapshot arrays
infrastructure rather than qsort+bsearch in a few places and hash
table for others?
Unless there is demonstrable benefit in doing so for the few places that
sort the arrays, I'm ѕkeptical it's worth the complexity. This patch is
targeted to XidInMVCCSnapshot(), which we can demonstrate has clear impact
on TPS for some workloads.
+ * The current value worked well in testing, but it's still mostly a guessed-at + * number that might need updating in the future. + */ +#define XIP_HASH_MIN_ELEMENTS (128) +Do you see a regression with a hash table for all the cases? Why can't
we just build a hash table irrespective of these limits and use it for
all the purposes instead of making it complex with different
approaches if we don't have measurable differences in the performance
or throughput?
I performed the same tests as before with a variety of values. Here are
the results:
writers HEAD 1 16 32 64 128
------------------------------------
16 659 698 678 659 665 664
32 645 661 688 657 649 663
64 659 656 653 649 663 692
128 641 636 639 679 643 716
256 619 641 619 643 653 610
512 530 609 582 602 605 702
768 469 610 608 551 571 582
1000 367 610 538 557 556 577
I was surpised to see that there really wasn't a regression at the low end,
but keep in mind that this is a rather large machine and a specialized
workload for generating snapshots with long [sub]xip arrays. That being
said, there really wasn't any improvement at the low end, either. If we
always built a hash table, we'd be introducing more overhead and memory
usage in return for approximately zero benefit. My intent was to only take
on the overhead in cases where we believe it might have a positive impact,
which is why I picked the somewhat conservative value of 128. If the
overhead isn't a concern, it might be feasible to always make [sub]xip a
hash table.
+static inline bool +XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt, + xip_hash_hash **xiph)+ /* Make sure the hash table is built. */ + if (*xiph == NULL) + { + *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL); + + for (int i = 0; i < xcnt; i++)Why create a hash table on the first search? Why can't it be built
while inserting or creating these snapshots? Basically, instead of the
array, can these snapshot structures be hash tables by themselves? I
know this requires a good amount of code refactoring, but worth
considering IMO as it removes bsearch thus might improve the
performance further.
The idea is to avoid the overhead unless something actually needs to
inspect these arrays.
[0]: /messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
Sounds worth pursuing.
On 2022-07-13 10:09:50 -0700, Nathan Bossart wrote:
The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table.
+1
The attached patch waits until a lookup of [sub]xip before generating the
hash table, so we only need to allocate enough space for the current
elements in the [sub]xip array, and we avoid allocating extra memory for
workloads that do not need the hash tables.
Hm. Are there any contexts where we'd not want the potential for failing due
to OOM?
I wonder if we additionally / alternatively could use a faster method of
searching the array linearly, e.g. using SIMD.
Another thing that might be worth looking into is to sort the xip/subxip
arrays into a binary-search optimized layout. That'd make the binary search
faster, wouldn't require additional memory (a boolean indicating whether
sorted somewhere, I guess), and would easily persist across copies of the
snapshot.
I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.
ISMT that the test wouldn't be likely to show those issues.
These hash tables are regarded as ephemeral; they only live in
process-local memory and are never rewritten, copied, or
serialized.
What does rewriting refer to here?
Not convinced that's the right idea in case of copying. I think we often end
up copying snapshots frequently, and building & allocating the hashed xids
separately every time seems not great.
+ snapshot->xiph = NULL;
+ snapshot->subxiph = NULL;
Do we need separate hashes for these? ISTM that if we're overflowed then we
don't need ->subxip[h], and if not, then the action for an xid being in ->xip
and ->subxiph is the same?
Greetings,
Andres Freund
Hi Andres,
Thanks for taking a look.
On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
Hm. Are there any contexts where we'd not want the potential for failing due
to OOM?
I'm not sure about this one.
I wonder if we additionally / alternatively could use a faster method of
searching the array linearly, e.g. using SIMD.
I looked into using SIMD. The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review. There may be a simpler/better way to implement the linear search,
but this seemed to work well. Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning. Here are the numbers:
writers head simd hash
256 663 632 694
512 530 618 637
768 489 544 573
1024 364 508 562
2048 185 306 485
4096 146 197 441
While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering. For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path. Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.
Another thing that might be worth looking into is to sort the xip/subxip
arrays into a binary-search optimized layout. That'd make the binary search
faster, wouldn't require additional memory (a boolean indicating whether
sorted somewhere, I guess), and would easily persist across copies of the
snapshot.
I spent some time looking into this, but I haven't attempted to implement
it. IIUC the most difficult part of this is sorting the array in place to
the special layout.
These hash tables are regarded as ephemeral; they only live in
process-local memory and are never rewritten, copied, or
serialized.What does rewriting refer to here?
I mean that a hash table created for one snapshot will not be cleared and
reused for another.
Not convinced that's the right idea in case of copying. I think we often end
up copying snapshots frequently, and building & allocating the hashed xids
separately every time seems not great.
Right. My concern with reusing the hash tables is that we'd need to
allocate much more space that would go largely unused in many cases.
+ snapshot->xiph = NULL;
+ snapshot->subxiph = NULL;Do we need separate hashes for these? ISTM that if we're overflowed then we
don't need ->subxip[h], and if not, then the action for an xid being in ->xip
and ->subxiph is the same?
Do you mean that we can combine these into one hash table? That might
work.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
xip_search_simd.patchtext/x-diff; charset=us-asciiDownload+41-21
Hi, all
if (!snapshot->suboverflowed) { /* we have full data, so search subxip */ - int32 j; - - for (j = 0; j < snapshot->subxcnt; j++) - { - if (TransactionIdEquals(xid, snapshot->subxip[j])) - return true; - } + if (XidInXip(xid, snapshot->subxip, snapshot->subxcnt, + &snapshot->subxiph)) + return true;/* not there, fall through to search xip[] */
}
If snaphost->suboverflowed is false then the subxcnt must be less than PGPROC_MAX_CACHED_SUBXIDS which is 64 now.
And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 128 currently during discussion.
So that, subxid’s hash table will never be used, right?
Regards,
Zhang Mingli
Show quoted text
On Jul 14, 2022, at 01:09, Nathan Bossart <nathandbossart@gmail.com> wrote:
Hi hackers,
A few years ago, there was a proposal to create hash tables for long
[sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
I was curious whether this idea still showed measurable benefits, so I
revamped the patch and ran the same test as before [1]. Here are the
results for 60₋second runs on an r5d.24xlarge with the data directory on
the local NVMe storage:writers HEAD patch diff
----------------------------
16 659 664 +1%
32 645 663 +3%
64 659 692 +5%
128 641 716 +12%
256 619 610 -1%
512 530 702 +32%
768 469 582 +24%
1000 367 577 +57%As before, the hash table approach seems to provide a decent benefit at
higher client counts, so I felt it was worth reviving the idea.The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table. Also, I've bumped up the threshold for creating hash tables to
128 based on the results of my testing. The attached patch waits until a
lookup of [sub]xip before generating the hash table, so we only need to
allocate enough space for the current elements in the [sub]xip array, and
we avoid allocating extra memory for workloads that do not need the hash
tables. I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.Thoughts?
[0] /messages/by-id/35960b8af917e9268881cd8df3f88320@postgrespro.ru
[1] /messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
<v1-0001-Optimize-lookups-in-snapshot-transactions-in-prog.patch>
В Ср, 13/07/2022 в 10:09 -0700, Nathan Bossart пишет:
Hi hackers,
A few years ago, there was a proposal to create hash tables for long
[sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
I was curious whether this idea still showed measurable benefits, so I
revamped the patch and ran the same test as before [1]. Here are the
results for 60₋second runs on an r5d.24xlarge with the data directory on
the local NVMe storage:writers HEAD patch diff
----------------------------
16 659 664 +1%
32 645 663 +3%
64 659 692 +5%
128 641 716 +12%
256 619 610 -1%
512 530 702 +32%
768 469 582 +24%
1000 367 577 +57%As before, the hash table approach seems to provide a decent benefit at
higher client counts, so I felt it was worth reviving the idea.The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table. Also, I've bumped up the threshold for creating hash tables to
128 based on the results of my testing. The attached patch waits until a
lookup of [sub]xip before generating the hash table, so we only need to
allocate enough space for the current elements in the [sub]xip array, and
we avoid allocating extra memory for workloads that do not need the hash
tables. I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.Thoughts?
[0] /messages/by-id/35960b8af917e9268881cd8df3f88320@postgrespro.ru
[1] /messages/by-id/057a9a95-19d2-05f0-17e2-f46ff20e9b3e@2ndquadrant.com
I'm glad my idea has been reborn.
Well, may be simplehash is not bad idea.
While it certainly consumes more memory and CPU instructions.
I'll try to review.
regards,
Yura Sokolov
On Sun, Jul 24, 2022 at 12:48:25PM +0800, Zhang Mingli wrote:
If snaphost->suboverflowed is false then the subxcnt must be less than PGPROC_MAX_CACHED_SUBXIDS which is 64 now.
And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 128 currently during discussion.
So that, subxid’s hash table will never be used, right?
This array will store up to TOTAL_MAX_CACHED_SUBXIDS transactions, which
will typically be much greater than 64. When there isn't any overflow,
subxip stores all of the subxids for all of the entries in the procarray.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Got it, thanks.
Regards,
Zhang Mingli
Show quoted text
On Jul 25, 2022, at 12:08, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Sun, Jul 24, 2022 at 12:48:25PM +0800, Zhang Mingli wrote:
If snaphost->suboverflowed is false then the subxcnt must be less than PGPROC_MAX_CACHED_SUBXIDS which is 64 now.
And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 128 currently during discussion.
So that, subxid’s hash table will never be used, right?
This array will store up to TOTAL_MAX_CACHED_SUBXIDS transactions, which
will typically be much greater than 64. When there isn't any overflow,
subxip stores all of the subxids for all of the entries in the procarray.--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote:
On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
I wonder if we additionally / alternatively could use a faster method of
searching the array linearly, e.g. using SIMD.I looked into using SIMD. The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review. There may be a simpler/better way to implement the linear search,
but this seemed to work well. Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning. Here are the numbers:writers head simd hash
256 663 632 694
512 530 618 637
768 489 544 573
1024 364 508 562
2048 185 306 485
4096 146 197 441While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering. For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path. Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.
From the discussion thus far, it seems there is interest in optimizing
[sub]xip lookups, so I'd like to spend some time moving it forward. I
think the biggest open question is which approach to take. Both the SIMD
and hash table approaches seem viable, but I think I prefer the SIMD
approach at the moment (see the last paragraph of quoted text for the
reasons). What do folks think?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Jul 26, 2022, at 03:04, Nathan Bossart <nathandbossart@gmail.com> wrote:
From the discussion thus far, it seems there is interest in optimizing
[sub]xip lookups, so I'd like to spend some time moving it forward. I
think the biggest open question is which approach to take. Both the SIMD
and hash table approaches seem viable, but I think I prefer the SIMD
approach at the moment (see the last paragraph of quoted text for the
reasons). What do folks think?--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
+1, I’m not familiar with SIMD, will try to review this patch.
Regards,
Zhang Mingli
On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote:
On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote:
On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
I wonder if we additionally / alternatively could use a faster method of
searching the array linearly, e.g. using SIMD.I looked into using SIMD. The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review. There may be a simpler/better way to implement the linear search,
but this seemed to work well. Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning. Here are the numbers:writers head simd hash
256 663 632 694
512 530 618 637
768 489 544 573
1024 364 508 562
2048 185 306 485
4096 146 197 441While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering. For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path. Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.From the discussion thus far, it seems there is interest in optimizing
[sub]xip lookups, so I'd like to spend some time moving it forward. I
think the biggest open question is which approach to take. Both the SIMD
and hash table approaches seem viable, but I think I prefer the SIMD
approach at the moment (see the last paragraph of quoted text for the
reasons). What do folks think?
Agreed on all points.
On Tue, Jul 26, 2022 at 11:19:06AM -0700, Andres Freund wrote:
On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote:
From the discussion thus far, it seems there is interest in optimizing
[sub]xip lookups, so I'd like to spend some time moving it forward. I
think the biggest open question is which approach to take. Both the SIMD
and hash table approaches seem viable, but I think I prefer the SIMD
approach at the moment (see the last paragraph of quoted text for the
reasons). What do folks think?Agreed on all points.
Great! Here is a new patch. A couple notes:
* I briefly looked into seeing whether auto-vectorization was viable and
concluded it was not for these loops.
* I borrowed USE_SSE2 from one of John Naylor's patches [0]/messages/by-id/CAFBsxsHko7yc8A-2PpjQ=2StomXF+T2jgKF=WaMFZWi8CvV7hA@mail.gmail.com. I'm not sure
whether this is committable, so I would welcome thoughts on the proper
form. Given the comment says that SSE2 is supported by all x86-64
hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not
enough to check for __x86_64__ and _M_AMD64?
* I haven't looked into adding an ARM implementation yet.
[0]: /messages/by-id/CAFBsxsHko7yc8A-2PpjQ=2StomXF+T2jgKF=WaMFZWi8CvV7hA@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v3-0001-Use-SSE2-intrinsics-for-XidInMVCCSnapshot.patchtext/x-diff; charset=us-asciiDownload+100-22
On Fri, Jul 29, 2022 at 4:34 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:
* I briefly looked into seeing whether auto-vectorization was viable and
concluded it was not for these loops.* I borrowed USE_SSE2 from one of John Naylor's patches [0]. I'm not
sure
whether this is committable,
I'll be the first to say it's not committable and needs some thought. Since
there are several recently proposed patches that take advantage of SSE2, it
seems time for me to open a new thread and get that prerequisite settled.
I'll do that next week.
so I would welcome thoughts on the proper
form. Given the comment says that SSE2 is supported by all x86-64
hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not
enough to check for __x86_64__ and _M_AMD64?
That's enough for emitting instructions that the target CPU can run, but
says nothing (I think) about the host compiler's ability to understand the
intrinsics and associated headers. The architecture is old enough that
maybe zero compilers in the buildfarm that target AMD64 fail to understand
SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check
is not necessary, but it was sufficient and already present, so I borrowed
it for the PoC.
--
John Naylor
EDB: http://www.enterprisedb.com
On Sat, Jul 30, 2022 at 12:02:02PM +0700, John Naylor wrote:
On Fri, Jul 29, 2022 at 4:34 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:* I borrowed USE_SSE2 from one of John Naylor's patches [0]. I'm not
sure
whether this is committable,
I'll be the first to say it's not committable and needs some thought. Since
there are several recently proposed patches that take advantage of SSE2, it
seems time for me to open a new thread and get that prerequisite settled.
I'll do that next week.
Awesome. I will help test and review.
so I would welcome thoughts on the proper
form. Given the comment says that SSE2 is supported by all x86-64
hardware, I'm not seeing why we need the SSE 4.2 checks. Is it not
enough to check for __x86_64__ and _M_AMD64?That's enough for emitting instructions that the target CPU can run, but
says nothing (I think) about the host compiler's ability to understand the
intrinsics and associated headers. The architecture is old enough that
maybe zero compilers in the buildfarm that target AMD64 fail to understand
SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check
is not necessary, but it was sufficient and already present, so I borrowed
it for the PoC.
Got it, makes sense.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Jul 29, 2022 at 10:38:11PM -0700, Nathan Bossart wrote:
On Sat, Jul 30, 2022 at 12:02:02PM +0700, John Naylor wrote:
I'll be the first to say it's not committable and needs some thought. Since
there are several recently proposed patches that take advantage of SSE2, it
seems time for me to open a new thread and get that prerequisite settled.
I'll do that next week.Awesome. I will help test and review.
While this prerequisite is worked out [0]/messages/by-id/CAFBsxsE2G_H_5Wbw+NOPm70-BK4xxKf86-mRzY=L2sLoQqM+-Q@mail.gmail.com, here is a new patch set. I've
added an 0002 in which I've made use of the proposed SSE2 linear search
function in several other areas. I haven't done any additional performance
analysis, and it's likely I'm missing some eligible code locations, but at
the very least, this demonstrates the reusability of the new function.
[0]: /messages/by-id/CAFBsxsE2G_H_5Wbw+NOPm70-BK4xxKf86-mRzY=L2sLoQqM+-Q@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Hi,
FWIW, I'd split the introduction of the helper and the use of it in snapmgr
into separate patches.
On 2022-08-02 15:13:01 -0700, Nathan Bossart wrote:
diff --git a/src/include/c.h b/src/include/c.h index d35405f191..2c1a47bc28 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void); #endif #endif+/* + * Are SSE2 intrinsics available? + */ +#if (defined(__x86_64__) || defined(_M_AMD64)) +#include <emmintrin.h> +#define USE_SSE2 +#endif +
It doesn't strike me as a good idea to include this in every single
translation unit in pg. That header (+dependencies) isn't small.
I'm on board with normalizing defines for SSE availability somewhere central
though.
+/* + * pg_linearsearch_uint32 + * + * Returns the address of the first element in 'base' that equals 'key', or + * NULL if no match is found. + */ +#ifdef USE_SSE2 +pg_attribute_no_sanitize_alignment() +#endif
What's the deal with this annotation? Needs a comment.
+static inline uint32 * +pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem)
Hm. I suspect this could be a bit faster if we didn't search for the offset,
but just for presence in the array? Most users don't need the offset.
Greetings,
Andres Freund
On Tue, Aug 02, 2022 at 03:55:39PM -0700, Andres Freund wrote:
FWIW, I'd split the introduction of the helper and the use of it in snapmgr
into separate patches.
Will do.
On 2022-08-02 15:13:01 -0700, Nathan Bossart wrote:
diff --git a/src/include/c.h b/src/include/c.h index d35405f191..2c1a47bc28 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void); #endif #endif+/* + * Are SSE2 intrinsics available? + */ +#if (defined(__x86_64__) || defined(_M_AMD64)) +#include <emmintrin.h> +#define USE_SSE2 +#endif +It doesn't strike me as a good idea to include this in every single
translation unit in pg. That header (+dependencies) isn't small.I'm on board with normalizing defines for SSE availability somewhere central
though.
Yeah, this is just a temporary hack for now. It'll go away once the
defines for SSE2 availability are committed.
+/* + * pg_linearsearch_uint32 + * + * Returns the address of the first element in 'base' that equals 'key', or + * NULL if no match is found. + */ +#ifdef USE_SSE2 +pg_attribute_no_sanitize_alignment() +#endifWhat's the deal with this annotation? Needs a comment.
Will do. c.h suggests that this should only be used for x86-specific code.
+static inline uint32 * +pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem)Hm. I suspect this could be a bit faster if we didn't search for the offset,
but just for presence in the array? Most users don't need the offset.
Just under half of the callers in 0002 require the offset, but I don't know
if any of those are worth optimizing in the first place. I'll change it
for now. It's easy enough to add it back in the future if required.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 3, 2022 at 6:43 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:
Just under half of the callers in 0002 require the offset, but I don't
know
if any of those are worth optimizing in the first place. I'll change it
for now. It's easy enough to add it back in the future if required.
Yeah, some of those callers will rarely have more than several elements to
search in the first place, or aren't performance-sensitive.
--
John Naylor
EDB: http://www.enterprisedb.com
Here is a new patch set. 0001 is the currently-proposed patch from the
other thread [0]/messages/by-id/CAFBsxsGktHL7=JXbgnKTi_uL0VRPcH4FSAqc6yK-3+JYfqPPjA@mail.gmail.com for determining SSE2 support. 0002 introduces the
optimized linear search function. And 0003 makes use of the new function
for the [sub]xip lookups in XidInMVCCSnapshot().
[0]: /messages/by-id/CAFBsxsGktHL7=JXbgnKTi_uL0VRPcH4FSAqc6yK-3+JYfqPPjA@mail.gmail.com
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com