[PATCH] SVE popcount support

Started by Malladi, Ramaover 1 year ago35 messageshackers

rvmallad@amazon.com

over 1 year ago

Attachments protected by Amazon:

[0001-SVE-popcount-support.patch]
https://us-west-2.secure-attach.amazon.com/a29c9ff9-1f9b-430f-9b3c-07fde9a419aa/f9178627-0600-4527-bc5c-7e4cb9ef6e9a
[SVE-popcount-support-PostgreSQL.png]
https://us-west-2.secure-attach.amazon.com/a29c9ff9-1f9b-430f-9b3c-07fde9a419aa/13c252c4-c45e-447c-9e55-fe637f8d345c

Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024, 15:43 (UTC+00:00).
[Tell us what you think] https://amazonexteu.qualtrics.com/jfe/form/SV_ehuz6zGo8YnsRKK
[For more information click here] https://docs.secure-attach.amazon.com/guide

Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.

[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Import Notes

Reference msg id not found: D3CC6BC8-1735-4939-BB77-E5B4EC66528E@amazon.com

Kirill Reshke

reshkekirill@gmail.com

over 1 year ago

In reply to: Malladi, Rama (#1)

Re: [PATCH] SVE popcount support

On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:

Attachments protected by Amazon: 0001-SVE-popcount-support.patch | SVE-popcount-support-PostgreSQL.png |
Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us what you think
For more information click here

Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.

[0] /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Hi! To register entry on commitfest you need to send patch in one of
this format:

https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F

This is useful for reviewers who use cfbot or cputube.

--
Best regards,
Kirill Reshke

Kirill Reshke

reshkekirill@gmail.com

over 1 year ago

In reply to: Malladi, Rama (#1)

Re: [PATCH] SVE popcount support

On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:

Attachments protected by Amazon: 0001-SVE-popcount-support.patch |

SVE-popcount-support-PostgreSQL.png |

Amazon has replaced the attachments in this email with download links.

Downloads will be available until December 27, 2024, 15:43 (UTC+00:00).
Tell us what you think

For more information click here

Please find attached a patch to PostgreSQL implementing SVE popcount. I

used John Naylor's test_popcount module [0]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179 to put together the attached
graphs. This test didn't show any regressions with a relatively small
number of bytes, and it showed the expected improvements with many bytes.

[0]

/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Hi!
I did look inside this patch. This was implemented mostly in the same way
as the current instructure selecting code, which is good.

=== patch itself

// for small buffer sizes (<= 128-bytes), execute 1-byte SVE instructions
// for larger buffer sizes (> 128-bytes), execute 1-byte + 8-byte SVE

instructions

// loop unroll by 2

PostgreSQL uses /* */ comment style.

+ if (bytes <= 128)
+ {
+ prologue_loop_bytes = bytes;
+ }
+ else
+ {
+ aligned_buf   = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) +

sizeof(uint64_t);

+ prologue_loop_bytes   = aligned_buf - buf;
+ }

For a single line stmt PostgreSQL does not use parenthesis. Examples [0]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179 &
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68

[0]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68

3) `if (bytes > 128)` Loop in pg_popcount_sve function should be commented.
There is too much code without any comment why it works. For example, If
original source of this is some paper or other work, we can reference it.

==== by-hand benching (I also use John Naylor's module)

non-patched

```
db1=# \timing
Timing is on.
db1=# select drive_popcount(10000000, 10000);
drive_popcount
----------------
64608
(1 row)

Time: 8886.493 ms (00:08.886) -- with small variance (+- 100ms)

db1=# select drive_popcount64(10000000, 10000);
drive_popcount64
------------------
64608
(1 row)

Time: 139501.555 ms (02:19.502) with small variance (+- 1-2sec)
```

patched

```
db1=# select drive_popcount(10000000, 10000);
drive_popcount
----------------
64608
(1 row)

Time: 8803.855 ms (00:08.804) -- with small variance
db1=# select drive_popcount64(10000000, 10000);
drive_popcount64
------------------
64608
(1 row)

Time: 200716.879 ms (02:21.717) -- with small variance
```

I'm not sure how to interpret these results. Looks like this does not help
much on a large $num?

--
Best regards,
Kirill Reshke

Bruce Momjian

bruce@momjian.us

over 1 year ago

In reply to: Malladi, Rama (#1)

Re: [PATCH] SVE popcount support

On Wed, Nov 27, 2024 at 03:43:27PM +0000, Malladi, Rama wrote:

• Attachments protected by Amazon:
• 0001-SVE-popcount-support.patch |
• SVE-popcount-support-PostgreSQL.png |

Amazon has replaced the attachments in this email with download links.
Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us
what you think
For more information click here

Please find attached a patch to PostgreSQL implementing SVE popcount. I used
John Naylor's test_popcount module [0] to put together the attached graphs.
This test didn't show any regressions with a relatively small number of bytes,
and it showed the expected improvements with many bytes.

You must attach actual attachments for this to be considered. Download
links are unacceptable.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

Malladi, Rama

ramamalladi@hotmail.com

over 1 year ago

In reply to: Kirill Reshke (#3)

Re: [PATCH] SVE popcount support

Thank you, Kirill, for the review and the feedback. Please find inline
my reply and an updated patch.

On 11/29/24 2:37 AM, Kirill Reshke wrote:

PostgreSQL uses /* */ comment style.

Fixed in the attached patch.

2)
For a single line stmt PostgreSQL does not use parenthesis. Examples
[0] & [1]

Fixed in the attached patch.

3) `if (bytes > 128)` Loop in pg_popcount_sve function should be
commented. There is too much code without any comment why it works.
For example, If original source of this is some paper or other work,
we can reference it.

From experimentation we found that for smaller buffer sizes, the
overhead of computing prologue, kernel and epilogue loop parameters is
high. So, for|< 128B|buffer size case, we use the SVE|8-bit|loop and for
larger buffer sizes (|>= 128|), we use the|64-bit|SVE implementation.
Attached is an SVE popcount implementation comparison.

==== by-hand benching (I also use John Naylor's module)

non-patched
```
db1=# select drive_popcount(10000000, 10000);
Time: 8886.493 ms (00:08.886) -- with small variance (+- 100ms)
db1=# select drive_popcount64(10000000, 10000);
Time: 139501.555 ms (02:19.502) with small variance (+- 1-2sec)
```

patched
```
db1=# select drive_popcount(10000000, 10000);
Time: 8803.855 ms (00:08.804) -- with small variance
db1=# select drive_popcount64(10000000, 10000);
Time: 200716.879 ms (02:21.717) -- with small variance
```

I'm not sure how to interpret these results. Looks like this does not
help much on a large $num?

Can you clarify on what system, architecture did you test this patch?
Note, the patch has optimizations only for `popcount` and not `popcount64`.

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Malladi, Rama (#5)

Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find inline my
reply and an updated patch.

Thanks for the updated patch. I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27). The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code. IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

#if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
#endif

#ifdef TRY_POPCNT_FAST
...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (check_sve_support())
+		pg_popcount_optimized = pg_popcount_sve;
+	else
+		pg_popcount_optimized = pg_popcount_slow;
+	return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+	// Read ID_AA64PFR0_EL1 register
+	uint64_t pfr0;
+	__asm__ __volatile__(
+	"mrs %0, ID_AA64PFR0_EL1"
+	: "=r" (pfr0));
+
+	// SVE bits are 32-35
+	return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

Devanga.Susmitha@fujitsu.com

over 1 year ago

In reply to: Nathan Bossart (#6)

Re: [PATCH] SVE popcount support

Hello,

We are sharing our patch for pg_popcount with SVE support as a contribution from our side in this thread. We hope this contribution will help in exploring and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the "choose_popcount_functions" method, to determine the correct popcount implementation based on the architecture, thereby requiring fewer code changes. The patch also includes implementations for popcount and popcount masked.
We can reference both solutions and work together toward achieving the most efficient and effective implementation for PostgreSQL.

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used PostgreSQL community recommended popcount-test-module[0] for benchmarking and observed a speed-up of more than 4x for larger buffers. Even for smaller inputs of size 8 and 16 bytes there aren't any performance degradations observed.
Looking forward to your thoughts!

Thanks & Regards,
Susmitha Devanga.

________________________________
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Wednesday, December 4, 2024 21:37
To: Malladi, Rama <ramamalladi@hotmail.com>
Cc: Kirill Reshke <reshkekirill@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find inline my
reply and an updated patch.

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

#if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
#endif

#ifdef TRY_POPCNT_FAST
...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+     if (check_sve_support())
+             pg_popcount_optimized = pg_popcount_sve;
+     else
+             pg_popcount_optimized = pg_popcount_slow;
+     return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+     // Read ID_AA64PFR0_EL1 register
+     uint64_t pfr0;
+     __asm__ __volatile__(
+     "mrs %0, ID_AA64PFR0_EL1"
+     : "=r" (pfr0));
+
+     // SVE bits are 32-35
+     return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

Malladi, Rama

ramamalladi@hotmail.com

over 1 year ago

In reply to: Devanga.Susmitha@fujitsu.com (#7)

Re: [PATCH] SVE popcount support

On 12/9/24 12:21 AM, Devanga.Susmitha@fujitsu.com wrote:

Hello,

We are sharing our patch for pg_popcount with SVE support as a
contribution from our side in this thread. We hope this contribution
will help in exploring and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the
"choose_popcount_functions" method, to determine the correct popcount
implementation based on the architecture, thereby requiring fewer code
changes. The patch also includes implementations for popcount and
popcount masked.
We can reference both solutions and work together toward achieving the
most efficient and effective implementation for PostgreSQL.

Thanks for the patch and it looks good. I will review the full patch in
the next couple of days. One observation was that the patch has `xsave`
flags added. This isn't needed.

`pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt +
cflags_popcnt_arm, 'xsave': cflags_xsave}`

*Algorithm Overview:*
1. For larger inputs, align the buffers to avoid double loads. For
smaller inputs alignment is not necessary and might even degrade the
performance.
2. Process the aligned buffer chunk by chunk till the last incomplete
chunk.
3. Process the last incomplete chunk.
*Our setup:*
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4
*Benchmark and Result:*
We have used PostgreSQL community recommended
popcount-test-module[0] for benchmarking and observed a speed-up of
more than 4x for larger buffers. Even for smaller inputs of size 8 and
16 bytes there aren't any performance degradations observed.
Looking forward to your thoughts!

I tested the patch and here attached is the performance I see on a
`c7g.xlarge`. The perf data doesn't quite match to what you observe
(especially for 256B). In the chart, I have comparison of baseline, AWS
SVE (what I had implemented) and Fujitsu SVE popcount implementations.
Can you confirm the command-line you had used for the benchmark run?

I had used the below command-line:

`sudo su postgres -c "/usr/local/pgsql/bin/psql -c 'EXPLAIN ANALYZE
SELECT drive_popcount(100000, 16);'"`

Show quoted text

------------------------------------------------------------------------
*From:* Nathan Bossart <nathandbossart@gmail.com>
*Sent:* Wednesday, December 4, 2024 21:37
*To:* Malladi, Rama <ramamalladi@hotmail.com>
*Cc:* Kirill Reshke <reshkekirill@gmail.com>; pgsql-hackers
<pgsql-hackers@postgresql.org>
*Subject:* Re: [PATCH] SVE popcount support
On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find

inline my

reply and an updated patch.

Thanks for the updated patch. I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/
<https://commitfest.postgresql.org/>) so that it is picked up by our
automated patch testing tools?
+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to 
use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27). The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code. IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major
version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char 
*buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

        #if defined(TRY_POPCNT_FAST) ||
defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
        extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
    #endif

    #ifdef TRY_POPCNT_FAST
    ...
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif
Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?
+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+     if (check_sve_support())
+             pg_popcount_optimized = pg_popcount_sve;
+     else
+             pg_popcount_optimized = pg_popcount_slow;
+     return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?
+// check if sve supported
+int check_sve_support(void)
+{
+     // Read ID_AA64PFR0_EL1 register
+     uint64_t pfr0;
+     __asm__ __volatile__(
+     "mrs %0, ID_AA64PFR0_EL1"
+     : "=r" (pfr0));
+
+     // SVE bits are 32-35
+     return (pfr0 >> 32) & 0xf;
+}
Is this based on some reference code from a manual that we could cite
here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?
+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)
I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Malladi, Rama (#8)

Re: [PATCH] SVE popcount support

Thank you for the suggestion; we have removed the `xsave` flag.

We have used the following command for benchmarking:
time ./build_fj/bin/psql pop_db -c "select drive_popcount(10000000, 16);"

We ran it 20 times and took the average to flatten any CPU fluctuations. The results observed on `m7g.4xlarge`are in the attached Excel file.

We have also updated the condition for buffer alignment, skipping the alignment process if the buffer is already aligned. This seems to have improved the performance by a few milliseconds because the input buffer provided by `drive_popcount` is already aligned. PFA for the updated patch file.

Thanks,
Chiranmoy

________________________________
From: Malladi, Rama <ramamalladi@hotmail.com>
Sent: Monday, December 9, 2024 10:06 PM
To: Susmitha, Devanga <Devanga.Susmitha@fujitsu.com>; Nathan Bossart <nathandbossart@gmail.com>
Cc: Kirill Reshke <reshkekirill@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>; Bhattacharya, Chiranmoy <Chiranmoy.Bhattacharya@fujitsu.com>; M A, Rajat <Rajat.Ma@fujitsu.com>; Hajela, Ragesh <Ragesh.Hajela@fujitsu.com>
Subject: Re: [PATCH] SVE popcount support

On 12/9/24 12:21 AM, Devanga.Susmitha@fujitsu.com<mailto:Devanga.Susmitha@fujitsu.com> wrote:
Hello,

Thanks for the patch and it looks good. I will review the full patch in the next couple of days. One observation was that the patch has `xsave` flags added. This isn't needed.

`pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt + cflags_popcnt_arm, 'xsave': cflags_xsave}`

I tested the patch and here attached is the performance I see on a `c7g.xlarge`. The perf data doesn't quite match to what you observe (especially for 256B). In the chart, I have comparison of baseline, AWS SVE (what I had implemented) and Fujitsu SVE popcount implementations. Can you confirm the command-line you had used for the benchmark run?

I had used the below command-line:

`sudo su postgres -c "/usr/local/pgsql/bin/psql -c 'EXPLAIN ANALYZE SELECT drive_popcount(100000, 16);'"`

________________________________
From: Nathan Bossart <nathandbossart@gmail.com><mailto:nathandbossart@gmail.com>
Sent: Wednesday, December 4, 2024 21:37
To: Malladi, Rama <ramamalladi@hotmail.com><mailto:ramamalladi@hotmail.com>
Cc: Kirill Reshke <reshkekirill@gmail.com><mailto:reshkekirill@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org><mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find inline my
reply and an updated patch.

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

#if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
#endif

#ifdef TRY_POPCNT_FAST
...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+     if (check_sve_support())
+             pg_popcount_optimized = pg_popcount_sve;
+     else
+             pg_popcount_optimized = pg_popcount_slow;
+     return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+     // Read ID_AA64PFR0_EL1 register
+     uint64_t pfr0;
+     __asm__ __volatile__(
+     "mrs %0, ID_AA64PFR0_EL1"
+     : "=r" (pfr0));
+
+     // SVE bits are 32-35
+     return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

#10

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#9)

Re: [PATCH] SVE popcount support

Hi all,

Here is the updated patch using pg_attribute_target("arch=armv8-a+sve") to compile the arch-specific function instead of using compiler flags.

---
Chiranmoy

#11

Malladi, Rama

ramamalladi@hotmail.com

over 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#10)

Re: [PATCH] SVE popcount support

Here is the updated patch using
pg_attribute_target("arch=armv8-a+sve") to compile the arch-specific
function instead of using compiler flags.

---

This looks good. Thanks Chiranmoy and team. Can you address any other
feedback from Nathan or others here? Then we can pursue further reviews
and merging of the patch.

Show quoted text

#12

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Malladi, Rama (#11)

Re: [PATCH] SVE popcount support

This looks good. Thanks Chiranmoy and team. Can you address any other feedback from Nathan or others here? Then we can pursue further reviews and merging of the patch.

Thank you for the review.
If there is no further feedback from the community, may we submit the patch for the next commit fest?

#13

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#12)

Re: [PATCH] SVE popcount support

On Wed, Jan 22, 2025 at 11:04:22AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

If there is no further feedback from the community, may we submit the
patch for the next commit fest?

I would encourage you to create a commitfest entry so that it is picked up
by our automated patch testing tools.

https://commitfest.postgresql.org/

--
nathan

#14

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Nathan Bossart (#13)

Re: [PATCH] SVE popcount support

The meson configure check seems to fail on my machine:

error: too many arguments to function call, expected 0, have 1
10 | svuint64_t popcnt = svcntb(val);
| ~~~~~~ ^~~

error: returning '__SVInt64_t' from a function with incompatible result type 'int'
12 | return popcnt == 0;
| ^~~~~~~~~~~

The autoconf version seems to work okay, though.

+ pgac_save_CFLAGS=$CFLAGS
+ CFLAGS="$pgac_save_CFLAGS $1"

I don't see any extra compiler flag tests used, so we no longer need this,
right?

+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi

I'm curious why this doesn't use Ac_cachevar like the examples above it
(e.g., PGAC_XSAVE_INTRINSICS).

+  prog = '''
+#include <arm_sve.h>
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF);
+    svuint64_t popcnt = svcntb(val);
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+'''

This test looks quite different than the autoconf one. Why is that? I
would expect them to be the same. And I think ideally the test would check
that all the intrinsics functions we need are available.

+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+    return getauxval(AT_HWCAP) & HWCAP_SVE;
+}

pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit
more complicated than this. Are we missing something here?

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */

Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs? I'm also curious what level of performance degradation you
were seeing.

+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK

nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above
the #include for pg_bitutils.h (but below the one for c.h).

--
nathan

#15

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Nathan Bossart (#14)

Re: [PATCH] SVE popcount support

The meson configure check seems to fail on my machine
This test looks quite different than the autoconf one. Why is that? I

would expect them to be the same. And I think ideally the test would check
that all the intrinsics functions we need are available.

Fixed, both meson and autoconf have the same test program with all the intrinsics.
Meson should work now.

+ pgac_save_CFLAGS=$CFLAGS
+ CFLAGS="$pgac_save_CFLAGS $1"

I don't see any extra compiler flag tests used, so we no longer need this,

right?

True, removed it.

+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi

I'm curious why this doesn't use Ac_cachevar like the examples above it

(e.g., PGAC_XSAVE_INTRINSICS).

Implemented using Ac_cachevar similar to PGAC_XSAVE_INTRINSICS.

+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK

nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above

the #include for pg_bitutils.h (but below the one for c.h).

Done.

pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit

more complicated than this. Are we missing something here?

SVE is only available in aarch64, so we don't need to worry about aarch32. The latest patch
includes runtime checks for Linux and FreeBSD. For all other operating systems, false is
returned, because we are unable to verify the check.

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */

Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs? I'm also curious what level of performance degradation you
were seeing.

Here is a comparison of all three cases. Alignment is marginally better for inputs
above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
Always Aligned => condition "bytes > 4 * vec_len" is removed.
Unaligned => the whole if block was removed

buf | Always Aligned | Aligned After 128B | Unaligned
--------+---------------+--------------------+------------
16 | 37.851 | 38.203 | 34.971
32 | 37.859 | 38.187 | 34.972
64 | 37.611 | 37.405 | 34.121
128 | 45.357 | 45.897 | 41.890
256 | 62.440 | 63.454 | 58.666
512 | 100.120 | 102.767 | 99.861
1024 | 159.574 | 158.594 | 164.975
2048 | 282.354 | 281.198 | 283.937
4096 | 532.038 | 531.068 | 533.699
8192 | 1038.973 | 1038.083 | 1039.206
16384 | 2028.604 | 2025.843 | 2033.940

---
Chiranmoy

#16

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#15)

Re: [PATCH] SVE popcount support

On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */
Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs? I'm also curious what level of performance degradation you
were seeing.

Here is a comparison of all three cases. Alignment is marginally better for inputs
above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
Always Aligned => condition "bytes > 4 * vec_len" is removed.
Unaligned => the whole if block was removed

buf | Always Aligned | Aligned After 128B | Unaligned
--------+---------------+--------------------+------------
16 | 37.851 | 38.203 | 34.971
32 | 37.859 | 38.187 | 34.972
64 | 37.611 | 37.405 | 34.121
128 | 45.357 | 45.897 | 41.890
256 | 62.440 | 63.454 | 58.666
512 | 100.120 | 102.767 | 99.861
1024 | 159.574 | 158.594 | 164.975
2048 | 282.354 | 281.198 | 283.937
4096 | 532.038 | 531.068 | 533.699
8192 | 1038.973 | 1038.083 | 1039.206
16384 | 2028.604 | 2025.843 | 2033.940

Hm. These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment. Is there any reason not
to do that?

+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

--
nathan

#17

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Nathan Bossart (#16)

Re: [PATCH] SVE popcount support

Hm. These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment. Is there any reason not
to do that?

It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

buf | Not Unrolled | Unrolled x2 | Unrolled x4
------+-------------+-------------+-------------
16 | 4.774 | 4.759 | 5.634
32 | 6.872 | 6.486 | 7.348
64 | 11.070 | 10.249 | 10.617
128 | 20.003 | 16.205 | 16.764
256 | 40.234 | 28.377 | 29.108
512 | 83.825 | 53.420 | 53.658
1024 | 191.181 | 101.677 | 102.727
2048 | 389.160 | 200.291 | 201.544
4096 | 785.742 | 404.593 | 399.134
8192 | 1587.226 | 811.314 | 810.961

/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
vec64_1 = svld1(pred, (const uint64 *) (buf + i));
accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));

vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}

-Chiranmoy

#18

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#17)

Re: [PATCH] SVE popcount support

On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

--
nathan

#19

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Nathan Bossart (#18)

Re: [PATCH] SVE popcount support

On Thu, Feb 06, 2025 at 10:33:35AM -0600, Nathan Bossart wrote:

On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

I've also noticed that the latest patch doesn't compile on my M3 macOS
machine. After a quick glance, I think the problem is that the
TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
versions.

../postgresql/src/port/pg_bitutils.c:230:41: error: invalid output constraint '=q' in asm
230 | __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
| ^
../postgresql/src/port/pg_bitutils.c:247:41: error: invalid output constraint '=q' in asm
247 | __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
| ^
2 errors generated.

--
nathan

#20

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Nathan Bossart (#19)

Re: [PATCH] SVE popcount support

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.

pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native

I've also noticed that the latest patch doesn't compile on my M3 macOS
machine. After a quick glance, I think the problem is that the
TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
versions.

Fixed, we tried using the existing "choose" logic guarded by TRY_POPCNT_FAST.
The latest patch bypasses TRY_POPCNT_FAST by having a separate choose logic
for aarch64.

-Chiranmoy

#21

Nathan Bossart

nathandbossart@gmail.com

over 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#20)

#22

Chiranmoy.Bhattacharya@fujitsu.com

over 1 year ago

In reply to: Nathan Bossart (#21)

#23

Nathan Bossart