optimize several list functions with SIMD intrinsics

Started by Nathan Bossartabout 3 years ago14 messages

nathandbossart@gmail.com

about 3 years ago

I noticed that several of the List functions do simple linear searches that
can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in
37a6e5d). The following table shows the time spent iterating over a list
of n elements (via list_member_int) one billion times on my x86 laptop.

n | head (ms) | patched (ms)
------+-----------+--------------
2 | 3884 | 3001
4 | 5506 | 4092
8 | 6209 | 3026
16 | 8797 | 4458
32 | 25051 | 7032
64 | 37611 | 12763
128 | 61886 | 22770
256 | 111170 | 59885
512 | 209612 | 103378
1024 | 407462 | 189484

I've attached a work-in-progress patch that implements these optimizations
for both x86 and arm, and I will register this in the July commitfest. I'm
posting this a little early in order to gauge interest. Presumably you
shouldn't be using a List if you have many elements that must be routinely
searched, but it might be nice to speed up these functions, anyway. This
was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

David Rowley

dgrowleyml@gmail.com

about 3 years ago

In reply to: Nathan Bossart (#1)

Re: optimize several list functions with SIMD intrinsics

On Wed, 8 Mar 2023 at 13:25, Nathan Bossart <nathandbossart@gmail.com> wrote:

I've attached a work-in-progress patch that implements these optimizations
for both x86 and arm, and I will register this in the July commitfest. I'm
posting this a little early in order to gauge interest.

Interesting and quite impressive performance numbers.

From having a quick glance at the patch, it looks like you'll need to
take some extra time to make it work on 32-bit builds.

David

Nathan Bossart

nathandbossart@gmail.com

about 3 years ago

In reply to: David Rowley (#2)

Re: optimize several list functions with SIMD intrinsics

On Wed, Mar 08, 2023 at 01:54:15PM +1300, David Rowley wrote:

Interesting and quite impressive performance numbers.

Thanks for taking a look.

From having a quick glance at the patch, it looks like you'll need to
take some extra time to make it work on 32-bit builds.

At the moment, the support for SIMD intrinsics in Postgres is limited to
64-bit (simd.h has the details). But yes, if we want to make this work for
32-bit builds, additional work is required.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Nathan Bossart

nathandbossart@gmail.com

about 3 years ago

In reply to: Nathan Bossart (#3)

Re: optimize several list functions with SIMD intrinsics

cfbot's Windows build wasn't happy with a couple of casts. I applied a
fix similar to c6a43c2 in v2. The patch is still very much a work in
progress.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Ankit Kumar Pandey

itsankitkp@gmail.com

about 3 years ago

In reply to: Nathan Bossart (#4)

Re: optimize several list functions with SIMD intrinsics

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: not tested
Spec compliant: not tested
Documentation: not tested

Hello,

Adding some review comments:

1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from
#ifdef USE_NO_SIMD
const ListCell *cell;
#endif
to #else like as mentioned below? This will make visual separation between #if cases more cleaner
#else
const ListCell *cell;

foreach(cell, list)
{
if (lfirst(cell) == datum)
return true;
}

return false;

#endif

2. Lots of duplicated
if (list == NIL) checks before calling list_member_inline_internal(list, datum)
Can we not add this check in list_member_inline_internal itself?
3. if (cell)
return list_delete_cell(list, cell) in list_delete_int/oid
can we change if (cell) to if (cell != NULL) as used elsewhere?
4. list_member_inline_interal_idx , there is typo in name.
5. list_member_inline_interal_idx and list_member_inline_internal are practically same with almost 90+ % duplication.
can we not refactor both into one and allow cell or true/false as return? Although I see usage of const ListCell so this might be problematic.
6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in list.c looks duplicated from pg_lfind32 (in pg_lfind.h), can we not reuse that?
7. Is it possible to add a benchmark which shows improvement against HEAD ?

Regards,
Ankit

The new status of this patch is: Waiting on Author

Ankit Kumar Pandey

itsankitkp@gmail.com

about 3 years ago

In reply to: Ankit Kumar Pandey (#5)

Re: optimize several list functions with SIMD intrinsics

7. Is it possible to add a benchmark which shows improvement against

HEAD ?

Please ignore this from my earlier mail, I just saw stats now.

Thanks,

Ankit

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Ankit Kumar Pandey (#5)

Re: optimize several list functions with SIMD intrinsics

Thanks for taking a look.

On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote:

1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from
#ifdef USE_NO_SIMD
const ListCell *cell;
#endif
to #else like as mentioned below? This will make visual separation between #if cases more cleaner

I would expect to see -Wdeclaration-after-statement warnings if we did
this.

2. Lots of duplicated
if (list == NIL) checks before calling list_member_inline_internal(list, datum)
Can we not add this check in list_member_inline_internal itself?

We probably could. I only extracted the NIL checks to simplify the code in
list_member_inline_internal() a bit.

3. if (cell)
return list_delete_cell(list, cell) in list_delete_int/oid
can we change if (cell) to if (cell != NULL) as used elsewhere?

Sure.

4. list_member_inline_interal_idx , there is typo in name.

Will fix.

5. list_member_inline_interal_idx and list_member_inline_internal are practically same with almost 90+ % duplication.
can we not refactor both into one and allow cell or true/false as return? Although I see usage of const ListCell so this might be problematic.

The idea was to skip finding the exact match if all we care about is
whether the element exists. This micro-optimization might be negligible,
in which case we could use list_member_inline_internal_idx() for both
cases.

6. Loop for (i = 0; i < tail_idx; i += nelem_per_iteration) for Vector32 in list.c looks duplicated from pg_lfind32 (in pg_lfind.h), can we not reuse that?

The list.c version is slightly different because we need to disregard any
matches that we find in between the data. For example, in an integer List,
the integer will take up 4 bytes of the 8-byte ListCell (for 64-bit
platforms).

typedef union ListCell
{
void *ptr_value;
int int_value;
Oid oid_value;
TransactionId xid_value;
} ListCell;

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Ankit Kumar Pandey

itsankitkp@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#7)

Re: optimize several list functions with SIMD intrinsics

Agree with your points Nathan. Just a headup.

On 14/03/23 03:10, Nathan Bossart wrote:
On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote:

1. In list_member_ptr, will it be okay to bring `const ListCell *cell` from
#ifdef USE_NO_SIMD
const ListCell *cell;
#endif
to #else like as mentioned below? This will make visual separation between #if cases more cleaner

I would expect to see -Wdeclaration-after-statement warnings if we did
this.

This worked fine for me, no warnings on gcc 12.2.0. Not a concern though.

Thanks,

Ankit

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Ankit Kumar Pandey (#8)

Re: optimize several list functions with SIMD intrinsics

On Wed, Mar 15, 2023 at 07:31:46PM +0530, Ankit Kumar Pandey wrote:

On 14/03/23 03:10, Nathan Bossart wrote:
On Sat, Mar 11, 2023 at 09:41:18AM +0000, Ankit Kumar Pandey wrote:

1. In list_member_ptr, will it be okay to bring `const ListCell
*cell` from #ifdef USE_NO_SIMD
const ListCell *cell;
#endif
to #else like as mentioned below? This will make visual separation between #if cases more cleaner

I would expect to see -Wdeclaration-after-statement warnings if we did
this.

This worked fine for me, no warnings on gcc 12.2.0. Not a concern though.

Did you try building without SIMD support? This is what I see:

list.c: In function ‘list_member_ptr’:
list.c:697:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
697 | const ListCell *cell;
| ^~~~~

If your build doesn't have USE_NO_SIMD defined, this warning won't appear
because the code in question will be compiled out.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#10

Ankit Kumar Pandey

itsankitkp@gmail.com

almost 3 years ago

In reply to: Nathan Bossart (#9)

Re: optimize several list functions with SIMD intrinsics

On 15/03/23 21:53, Nathan Bossart wrote:

Did you try building without SIMD support? This is what I see:

list.c: In function ‘list_member_ptr’:
list.c:697:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
697 | const ListCell *cell;
| ^~~~~

If your build doesn't have USE_NO_SIMD defined, this warning won't appear
because the code in question will be compiled out.

My mistake, I tried with USE_NO_SIMD defined and it showed the warning. Sorry for the noise.

Regards,
Ankit

#11

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: Ankit Kumar Pandey (#10)

Re: optimize several list functions with SIMD intrinsics

Here is a new patch set. I've split it into two patches: one for the
64-bit functions, and one for the 32-bit functions. I've also added tests
for pg_lfind64/pg_lfind64_idx and deduplicated the code a bit.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#12

John Naylor

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Nathan Bossart (#1)

Re: optimize several list functions with SIMD intrinsics

On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:

was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.

It seems like that should be demonstrated before seriously considering
this, like a profile where the relevant list functions show up.

--
John Naylor
EDB: http://www.enterprisedb.com

#13

Nathan Bossart

nathandbossart@gmail.com

almost 3 years ago

In reply to: John Naylor (#12)

Re: optimize several list functions with SIMD intrinsics

On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote:

On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:

was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.

It seems like that should be demonstrated before seriously considering
this, like a profile where the relevant list functions show up.

Agreed.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#14

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 2 years ago

In reply to: Nathan Bossart (#13)

Re: optimize several list functions with SIMD intrinsics

On 21/04/2023 23:33, Nathan Bossart wrote:

On Fri, Apr 21, 2023 at 01:50:34PM +0700, John Naylor wrote:

On Wed, Mar 8, 2023 at 7:25 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:

was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.

It seems like that should be demonstrated before seriously considering
this, like a profile where the relevant list functions show up.

Agreed.

Grepping for "tlist_member" and "list_delete_ptr", I don't see any
callers in hot codepaths where this could make a noticeable difference.
So I've marked this as Returned with Feedback in the commitfest.

I noticed that several of the List functions do simple linear searches that
can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in
37a6e5d). The following table shows the time spent iterating over a list
of n elements (via list_member_int) one billion times on my x86 laptop.

n | head (ms) | patched (ms)
------+-----------+--------------
2 | 3884 | 3001
4 | 5506 | 4092
8 | 6209 | 3026
16 | 8797 | 4458
32 | 25051 | 7032
64 | 37611 | 12763
128 | 61886 | 22770
256 | 111170 | 59885
512 | 209612 | 103378
1024 | 407462 | 189484

I'm surprised to see an improvement with n=2 and n=2. AFAICS, the
vectorization only kicks in when n >= 8.

--
Heikki Linnakangas
Neon (https://neon.tech)

optimize several list functions with SIMD intrinsics

Attachments:

Attachments:

Attachments: