Review/Pull Request: Adding new CRC32C implementation for IBM S390X

Started by Eduard Stefes10 months ago17 messages

Eduard.Stefes@ibm.com

10 months ago

Hi,

Here I send a patch that adds a vectorized version of CRC32C for the
IBM S390X hardware. I kindly ask for a review of the code and to pick
it up in upstream postgresql.

# Why this patch:

We noticed that postgres running on an S390X will spend much longer in
CRC32C as compared to other platform with optimized crc32c. Kindl
Hendrik Brueckner would allow us to re-license his implementation of an
optimized crc32c under postgres-license. This implementation is already
used in gzip, zlib-ng and the linux kernel.

# The optimized CRC32C:

The IBM S390X platform has no dedicated CRC infrastructure. The
algorithm works by using >>reduction constants to fold and process
particular chunks of the input data stream in parallel.<< This makes
grate use of the S390X vector units. Depending on the size of the input
stream a speedup in the order of magnitude can be achieved(compared to
sb8).

# Runtime checks:

The runtime detection strategy follows the same approach as the ARM
code. If the code is compiled with all needed flags enabled the runtime
detection will not be compiled in. If the build system can enable all
needed flags itself, it will also enable runtime detection.

# Slicing by 8:

The standard sb8 code is still always compiled and will be used for
this cases:
- the vector units need to operate on double word boundaries. If the
input stream is not aligned we use sb8 up to the next boundary
- using the vector units for data smaller then 64 byte will neglect the
speed improvement of the algorithm, as register setup and post
processing will eat up all benefits.
- the reduction and folding constants are precalculated for 64 byte
chunks. Adding code for smaller chunks would drastically increase the
complexity.

# The glue code:

I tried to follow the postgres coding conventions. I ran
`./pg_bsd_indent -i4 -l79 -di12 -nfc1 -nlp -sac ...` as mentioned in
src/tools/pg_bsd_indent/README. But for me this will absolutely not
format code according to the postgres coding convention. Therefor I
formatted everything by hand.

I feared that simply writing a function pointer in a software spawning
many threads and forks might cause issues. So i decided to use
`__atomic_store_n` to set the CRC function pointer. Indeed I noticed
that the other _choose.c files did not do this. However I am very
confident that `__atomic_store_n` will always be available on a S390X.

As this is the first time I am writing m4/autotools, I'd kindly ask the
reviewer for special care there :) . There may be dragons. But I have
high hopes all is OK.

Cheers and thanks to all for their work,

--
Eduard Stefes <eduard.stefes@ibm.com>

Aleksander Alekseev

aleksander@timescale.com

10 months ago

In reply to: Eduard Stefes (#1)

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

Hi Eduard,

Here I send a patch that adds a vectorized version of CRC32C for the
IBM S390X hardware. I kindly ask for a review of the code and to pick
it up in upstream postgresql.
[...]
Cheers and thanks to all for their work,

Thanks for submitting this patch.

Please register it on the nearest open commitfest [1]https://commitfest.postgresql.org/53/ so that it
wouldn't be lost.

I didn't review the patch but wanted to point out that when it comes
to performance improvements it's typically useful to provide some
benchmarks.

[1]: https://commitfest.postgresql.org/53/

--
Best regards,
Aleksander Alekseev

John Naylor

john.naylor@enterprisedb.com

10 months ago

In reply to: Eduard Stefes (#1)

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

On Wed, May 7, 2025 at 8:04 PM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:

Hi,

Here I send a patch that adds a vectorized version of CRC32C for the
IBM S390X hardware. I kindly ask for a review of the code and to pick
it up in upstream postgresql.

Thanks!

# Why this patch:

We noticed that postgres running on an S390X will spend much longer in
CRC32C as compared to other platform with optimized crc32c.

Right. That's also true of PowerPC -- perhaps that's a different team
than yours?

Hendrik Brueckner would allow us to re-license his implementation of an
optimized crc32c under postgres-license. This implementation is already
used in gzip, zlib-ng and the linux kernel.

That's good news, although the copyright may need discussion. Please
see the three entries starting here:

https://wiki.postgresql.org/wiki/Developer_FAQ#Do_I_need_to_sign_a_copyright_assignment.3F

...and if you have questions ask us.

# The optimized CRC32C:

The IBM S390X platform has no dedicated CRC infrastructure. The
algorithm works by using >>reduction constants to fold and process
particular chunks of the input data stream in parallel.<< This makes
grate use of the S390X vector units. Depending on the size of the input
stream a speedup in the order of magnitude can be achieved(compared to
sb8).

Okay, we are familiar with this technique as applied to AVX-512 and
Arm Crypto extensions.

# Runtime checks:

The runtime detection strategy follows the same approach as the ARM
code. If the code is compiled with all needed flags enabled the runtime
detection will not be compiled in. If the build system can enable all
needed flags itself, it will also enable runtime detection.

This case is a bit different, since Arm can compute hardware CRC on
any input size. The fast path here is only guaranteed to be taken at
inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte
alignment. For larger inputs, an indirect call shouldn't matter, so to
keep it simple I'd recommend having only the runtime check. And for
smaller inputs call sb8 directly -- this will avoid the indirect call
which in turn just jumps to sb8. This is important because the server
computes CRC on a 20-byte input for the WAL record header while
holding a lock. Something like:

#define COMP_CRC32C(crc, data, len) \
((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))

static inline
pg_crc32c
pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
{
/*
if (len < VX_MIN_LEN + VX_ALIGN_MASK)
{
/*
* For inputs small enough that we may not take the vector path,
* just call sb8 directly.
*/
return pg_comp_crc32c_sb8(crc, data, len);;
}
else
/* Otherwise, use a runtime check for S390_VX instructions. */
return pg_comp_crc32c(crc, data, len);
}

# The glue code:

I tried to follow the postgres coding conventions. I ran
`./pg_bsd_indent -i4 -l79 -di12 -nfc1 -nlp -sac ...` as mentioned in
src/tools/pg_bsd_indent/README. But for me this will absolutely not
format code according to the postgres coding convention. Therefor I
formatted everything by hand.

The entry point for that is a perl script, so you can invoke it on the
relevant directories like so:

src/tools/pgindent/pgindent src/port/ src/include/port

I feared that simply writing a function pointer in a software spawning
many threads and forks might cause issues. So i decided to use
`__atomic_store_n` to set the CRC function pointer. Indeed I noticed
that the other _choose.c files did not do this. However I am very
confident that `__atomic_store_n` will always be available on a S390X.

If this behaves like Linux on other platforms, it'll be copy-on-write
when forking, and only the children will set this variable on first
call. Is there some documented reason we need special treatment here?

As this is the first time I am writing m4/autotools, I'd kindly ask the
reviewer for special care there :) . There may be dragons. But I have
high hopes all is OK.

I haven't taken a close look, but here are a couple things I noticed
at a glance:

+# PGAC_S390X_BAD_VECTOR_INTRINSICS

Does the build still fall back to sb8 with a warning, or does it fail?
It'd simpler if the guard against certain clang versions went into the
choose function, so it can fall back to sb8.

-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
+# all versions of pg_crc32c_armv8.o and pg_crc32c_s390x.o need CFLAGS_CRC
 pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
+pg_crc32c_s390x.o: CFLAGS+=$(CFLAGS_CRC)

It seems we have the same three *.o objects on S390X, for example:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=siskin&dt=2025-05-07%2019%3A08%3A40&stg=build

...so all should have the same additional CFLAGS.

+  prog = '''
+#include <vecintrin.h>
+int main(void) {
+    // test for vector extension
+    unsigned long long a __attribute__((vector_size(16))) = { 0 };
+    unsigned long long b __attribute__((vector_size(16))) = { 0 };
+    unsigned char c __attribute__((vector_size(16))) = { 0 };
+    c = vec_gfmsum_accum_128(a, b, c);
+    return c[0];
+}'''

We prefer that 'a' and 'b' are declared as global variables, just to
make it as realistic as possible, although it doesn't seem to make
much difference when I tried it on Compiler Explorer. (Same for
autoconf)

While playing around with that, the above doesn't compile on gcc 14,
but 13 and 15 work -- maybe a bug report is in order?

Also, if we can assume the existence of the other necessary vector
instructions if the above exists, that would be good to put in a
comment.

--
John Naylor
Amazon Web Services

John Naylor

john.naylor@enterprisedb.com

10 months ago

In reply to: Aleksander Alekseev (#2)

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

On Wed, May 7, 2025 at 8:15 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:

I didn't review the patch but wanted to point out that when it comes
to performance improvements it's typically useful to provide some
benchmarks.

+1 -- It's good to have concrete numbers for the commit message, and
also to verify improvement on short inputs. There is a test harness in
the v7-0002 patch from here:

/messages/by-id/CANWCAZaD5niydBF6q3V_cjApNV05cw-LpxxFtMbwDPLsz-PjbQ@mail.gmail.com

After building, run the "test-crc.sh" script here after executing
"CREATE EXTENSION test_crc32c;":

/messages/by-id/CANWCAZahvhE-+htZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ@mail.gmail.com

--
John Naylor
Amazon Web Services

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: John Naylor (#3)

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

John Naylor <johncnaylorls@gmail.com> writes:

We prefer that 'a' and 'b' are declared as global variables, just to
make it as realistic as possible, although it doesn't seem to make
much difference when I tried it on Compiler Explorer. (Same for
autoconf)

Yeah, see commit fdb5dd6331e305f797bb589747f056062c305f0b for
some recent precedent, and for explanation of why we think it's
worth worrying about. It's not so much that "global" matters,
it's that we don't want the compiler to have the option to
fold the whole test to a constant, which it would be within
its rights to do as you have this.

regards, tom lane

Eduard Stefes

Eduard.Stefes@ibm.com

10 months ago

In reply to: John Naylor (#4)

RE: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

Hi,

So I worked on the algorithm to also work on buffers between 16-64
bytes. Then I ran the performance measurement on two
dataset[^raw_data_1] [^raw_data_2]. And created two diagrams
[^attachment].

my findings so far:

- the optimized crc32cvx is faster
- the sb8 performance is heavily depending on alignment (see the
ripples every 8 bytes)
- the 8 byte ripple is also visible in the vx implementation. As it can
only perform on 16 or 64 byte chunks, it will still use sb8 for the
remaining bytes.
- there is no obvious speed regression in the vx algorithm. Except
raw_data_2-28 which I assume is a fluke. I am sharing the system with a
bunch of other devs.

I hope this this is acceptable as performance measurement. However we
will setup a dedicated performance test and try to get precise numbers
without side-effects. But it may take some time until we get to that.

I'll post the update on the Code together with the other requested
updates.

cheers, Eddy

[^raw_data_1]
bytes crc32c_sb8 crc32c_vx
4 6.54 ms 6.548 ms
8 4.476 ms 4.47 ms
10 7.346 ms 7.348 ms
12 10.955 ms 10.958 ms
14 14.548 ms 14.546 ms
16 6.837 ms 6.193 ms
32 12.23 ms 6.741 ms
64 22.826 ms 7.6 ms
80 28.536 ms 8.307 ms
96 34.426 ms 9.09 ms
112 40.295 ms 9.844 ms
128 46.053 ms 10.825 ms
144 51.868 ms 11.712 ms
160 65.91 ms 12.122 ms
176 71.649 ms 13.055 ms
192 77.465 ms 11.716 ms
208 83.286 ms 13.532 ms
224 88.991 ms 13.165 ms
240 94.875 ms 13.881 ms
256 100.653 ms 13.147 ms
8192 2967.477 ms 182.911 ms

[^raw_data_2]
bytes crc32c_sb8 crc32c_vx
4 6.543 ms 6.536 ms
8 4.476 ms 4.47 ms
10 7.35 ms 7.345 ms
12 10.96 ms 10.954 ms
14 14.552 ms 14.588 ms
16 6.843 ms 6.189 ms
18 10.253 ms 9.814 ms
24 9.645 ms 9.924 ms
28 15.957 ms 17.211 ms
32 12.226 ms 6.726 ms
36 18.823 ms 14.484 ms
42 17.855 ms 14.271 ms
48 17.342 ms 7.344 ms
52 24.208 ms 15.306 ms
58 23.525 ms 14.695 ms
64 22.818 ms 7.593 ms

Show quoted text

On Thu, 2025-05-08 at 05:32 +0700, John Naylor wrote:

On Wed, May 7, 2025 at 8:15 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:

I didn't review the patch but wanted to point out that when it
comes
to performance improvements it's typically useful to provide some
benchmarks.

+1 -- It's good to have concrete numbers for the commit message, and
also to verify improvement on short inputs. There is a test harness
in
the v7-0002 patch from here:

/messages/by-id/CANWCAZaD5niydBF6q3V_cjApNV05cw-LpxxFtMbwDPLsz-PjbQ@mail.gmail.com

After building, run the "test-crc.sh" script here after executing
"CREATE EXTENSION test_crc32c;":

/messages/by-id/CANWCAZahvhE-+htZiUyzPiS5e45ukx5877mD-dHr-KSX6LcdjQ@mail.gmail.com

--
John Naylor
Amazon Web Services

John Naylor

john.naylor@enterprisedb.com

10 months ago

In reply to: Eduard Stefes (#6)

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

On Tue, May 27, 2025 at 3:24 AM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:

So I worked on the algorithm to also work on buffers between 16-64
bytes. Then I ran the performance measurement on two
dataset[^raw_data_1] [^raw_data_2]. And created two diagrams
[^attachment].

my findings so far:

- the optimized crc32cvx is faster
- the sb8 performance is heavily depending on alignment (see the
ripples every 8 bytes)

To be precise, these all seem 8-byte aligned at a glance, and the
ripple is due to input length.

- the 8 byte ripple is also visible in the vx implementation. As it can
only perform on 16 or 64 byte chunks, it will still use sb8 for the
remaining bytes.
- there is no obvious speed regression in the vx algorithm. Except
raw_data_2-28 which I assume is a fluke. I am sharing the system with a
bunch of other devs.

I hope this this is acceptable as performance measurement. However we
will setup a dedicated performance test and try to get precise numbers
without side-effects. But it may take some time until we get to that.

This already looks like a solid improvement at 32 bytes and above -- I
don't think we need less noisy numbers. Also for future reference,
please reply in-line. Thanks!

--
John Naylor
Amazon Web Services

Eduard Stefes

Eduard.Stefes@ibm.com

9 months ago

In reply to: John Naylor (#3)

RE: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

On Thu, 2025-05-08 at 05:23 +0700, John Naylor wrote:

Right. That's also true of PowerPC -- perhaps that's a different team
than yours?

indeed that's another team. I can ping them but there is not much more
I can do.

This case is a bit different, since Arm can compute hardware CRC on
any input size. The fast path here is only guaranteed to be taken at
inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte
alignment. For larger inputs, an indirect call shouldn't matter, so
to
keep it simple I'd recommend having only the runtime check. And for
smaller inputs call sb8 directly -- this will avoid the indirect call
which in turn just jumps to sb8. This is important because the server
computes CRC on a 20-byte input for the WAL record header while
holding a lock.

I worked on the code and got it working on 16 byte chunks so the WAL
writes will directly benefit from this. I will try to calculate and add
the polynomials for the smaller chunks. Maybe we where wrong and we
will still see a speed improvement, also for the smaller pieces.
However that is something that I cannot tackle immediately. I'll come
back to this in the summer.

Something like:

#define COMP_CRC32C(crc, data, len) \
((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))

static inline
pg_crc32c
pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
{
/*
if (len < VX_MIN_LEN + VX_ALIGN_MASK)
{
    /*
     * For inputs small enough that we may not take the vector path,
     * just call sb8 directly.
     */
    return pg_comp_crc32c_sb8(crc, data, len);;
}
else
    /* Otherwise, use a runtime check for S390_VX instructions. */
    return pg_comp_crc32c(crc, data, len);
}

Is static inline generally preferred here or should I do something
like:

#define COMP_CRC32C(crc, data, len) \
((crc) = (len) < MAX_S390X_CRC ? \
pg_comp_crc32c_sb8((crc),(data),(len)) : \
pg_comp_crc32c((crc), (data), (len)))

If this behaves like Linux on other platforms, it'll be copy-on-write
when forking, and only the children will set this variable on first
call. Is there some documented reason we need special treatment here?

yes for forks i think there is no problem. But I don't know what will
happen when two cpus will try to read/set the value. The value is
definitely not updated across both CPU caches. So my guess would be
that there is a high chance that the _choose code is called multiple
times(IF there really are multiple threads trying to calculate a crc at
the same time AND the pointer is not already initialized).

Indeed, because it makes no difference if the code is executed multiple
times its totally fine to not use atomic store here. So I'll remove the
code and the checks from the build system again.

Does the build still fall back to sb8 with a warning, or does it
fail?
It'd simpler if the guard against certain clang versions went into
the
choose function, so it can fall back to sb8.

My current implementation will silently fall back to sb8 and not
compile any of the s390x code.

Other possible strategies are:

- print a warning and fall back to sb0
- fail the compile and inform the user to switch compiler or to disable
crc32vx in the build system
- move all the checks into the .c file and to this with with macros

indeed for zlib-ng we went the way to put all the checks into the code
and fail the build if compiled with a known broken compiler.

My arguments to not do the same in postgres are:

- postgres uses sb8 on s390x already so by NOT changing it to another
algorithm we do not change the expected behavior.
- imho, this is something that the build system should check and take
care of, not the code that's getting compiled.

What is the preferred postgres way here?

While playing around with that, the above doesn't compile on gcc 14,
but 13 and 15 work -- maybe a bug report is in order?

Thy for reporting I'll look into that!

Many thanks for all other comments. I'll work on the code to
incorporate all remarks.

--
Eduard Stefes <eduard.stefes@ibm.com>

John Naylor

john.naylor@enterprisedb.com

9 months ago

In reply to: Eduard Stefes (#8)

Re: Review/Pull Request: Adding new CRC32C implementation for IBM S390X

On Sat, May 31, 2025 at 2:41 AM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:

On Thu, 2025-05-08 at 05:23 +0700, John Naylor wrote:

This case is a bit different, since Arm can compute hardware CRC on
any input size. The fast path here is only guaranteed to be taken at
inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte
alignment. For larger inputs, an indirect call shouldn't matter, so
to
keep it simple I'd recommend having only the runtime check. And for
smaller inputs call sb8 directly -- this will avoid the indirect call
which in turn just jumps to sb8. This is important because the server
computes CRC on a 20-byte input for the WAL record header while
holding a lock.

I worked on the code and got it working on 16 byte chunks so the WAL
writes will directly benefit from this. I will try to calculate and add
the polynomials for the smaller chunks. Maybe we where wrong and we
will still see a speed improvement, also for the smaller pieces.
However that is something that I cannot tackle immediately. I'll come
back to this in the summer.

The measurements from your previous message used 16-byte chunks -- did
that have the necessary constants to get the correct answer?

I'm not sure what you're in doubt about, but I guess it will be clear
with your next patch.

Something like:

#define COMP_CRC32C(crc, data, len) \
((crc) = pg_comp_crc32c_dispatch((crc), (data), (len)))

static inline
pg_crc32c
pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len)
{
/*
if (len < VX_MIN_LEN + VX_ALIGN_MASK)
{
/*
* For inputs small enough that we may not take the vector path,
* just call sb8 directly.
*/
return pg_comp_crc32c_sb8(crc, data, len);;
}
else
/* Otherwise, use a runtime check for S390_VX instructions. */
return pg_comp_crc32c(crc, data, len);
}

Is static inline generally preferred here or should I do something
like:

#define COMP_CRC32C(crc, data, len) \
((crc) = (len) < MAX_S390X_CRC ? \
pg_comp_crc32c_sb8((crc),(data),(len)) : \
pg_comp_crc32c((crc), (data), (len)))

That macro is actually better -- the static inline was used for x86
because it has looping which is not the case for S390X.

Does the build still fall back to sb8 with a warning, or does it
fail?
It'd simpler if the guard against certain clang versions went into
the
choose function, so it can fall back to sb8.

My current implementation will silently fall back to sb8 and not
compile any of the s390x code.

That seems like a good design.

Other possible strategies are:

- print a warning and fall back to sb0
- fail the compile and inform the user to switch compiler or to disable
crc32vx in the build system
- move all the checks into the .c file and to this with with macros

indeed for zlib-ng we went the way to put all the checks into the code
and fail the build if compiled with a known broken compiler.

My arguments to not do the same in postgres are:

- postgres uses sb8 on s390x already so by NOT changing it to another
algorithm we do not change the expected behavior.
- imho, this is something that the build system should check and take
care of, not the code that's getting compiled.

I'm inclined to agree, but I wonder if these guards should go inside
the link check. I.e. the separate config machinery for "have bad
compiler" seems like unnecessary clutter.

--
John Naylor
Amazon Web Services

#10

Eduard Stefes

Eduard.Stefes@ibm.com

9 months ago

In reply to: John Naylor (#9)

[V2] Adding new CRC32C implementation for IBM S390X

Hi,

So here is V2 of the crc32c_s390x patch. Changes from V1 are:

- added gcc 14-14.2 as known broken compiler (bug was fixed with 14.3)
- moved broken compiler check to vx extension compile&link check
- made variables global in the extension check
- create dependency to getauxval in configure, so we don't compile the
code if we won't be able to detect the cpu extension at runtime
- moved buffer length check into macro
- changed minimal buffer length for crc32c_s390x from 64 to 16 byte
- added CFLAGS_CRC to all crc32c_s390x* artifacts
- fixed formatting with pgindent
- fixed typos in email address

--
Eduard Stefes <eduard.stefes@ibm.com>

#11

John Naylor

john.naylor@enterprisedb.com

9 months ago

In reply to: Eduard Stefes (#10)

Re: [V2] Adding new CRC32C implementation for IBM S390X

On Thu, Jun 5, 2025 at 2:15 PM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:

Hi,

So here is V2 of the crc32c_s390x patch. Changes from V1 are:

Hi Eduard, thanks for the update. Note, it's not necessary to change
the thread subject, and in fact I came very close to missing this
email entirely.

Secondly, I haven't seen a response about the copyright. I'm referring
to this part in particular:

I shared this link in my first reply:

https://wiki.postgresql.org/wiki/Developer_FAQ#Do_I_need_to_sign_a_copyright_assignment.3F

It says, in part:

"May I add my own copyright notice where appropriate?

No, please don't. We like to keep the legal information short and
crisp. Additionally, we've heard that could possibly pose problems for
corporate users."

- added gcc 14-14.2 as known broken compiler (bug was fixed with 14.3)

Why do we need to mark this as broken? In my research, it caused
compiler errors with those versions -- that itself should cause
fallback to sb8.

On that note, I'm now wondering if clang compiles and links
successfully but the program is broken? Or does it fail to compile? If
the latter, we should treat them the same: there is no need for "known
broken versions" in the configure checks, as it's just a matter of
documentation.

- create dependency to getauxval in configure, so we don't compile the
code if we won't be able to detect the cpu extension at runtime

That's just unnecessary clutter, and we don't do that anywhere else.
We already have the HAVE_GETAUXVAL symbol to guard the runtime check.

As I alluded to before, I'm not in favor of having both direct-call
and runtime-check paths here. The reason x86 and Arm have it is
because their hardware support works on any length input. Is there any
reason to expect auxv.h to be unavailable?

Also, lately we have been moving away from having separate *choose.c
files, since they add complexity to the build system. I'd suggest
looking at src/port/pg_popcount_aarch64.c -- the file is compiled
unconditionally but inside it has the appropriate #ifdef's as well as
the choice function. See how it handles auxv.h as well.

- moved buffer length check into macro
- changed minimal buffer length for crc32c_s390x from 64 to 16 byte

+#define COMP_CRC32C(crc, data, len) \
+ ((crc) = (len) < 16 ? pg_comp_crc32c_sb8((crc),(data),(len)) :
pg_comp_crc32c((crc), (data), (len)))

Your tests demonstrated improvement with 32 bytes and above, and
nothing less than 31 makes sense as a minimum because of the 16-byte
alignment requirement. I've mentioned that we don't want the 20-byte
WAL header computation to have any more overhead than it does now.

--
John Naylor
Amazon Web Services

#12

Eduard Stefes

Eduard.Stefes@ibm.com

8 months ago

In reply to: John Naylor (#11)

RE: [V2] Adding new CRC32C implementation for IBM S390X

Hi,

On Wed, 2025-06-11 at 13:48 +0700, John Naylor wrote:

Hi Eduard, thanks for the update. Note, it's not necessary to change
the thread subject, and in fact I came very close to missing this
email entirely.

Sorry for the confusion.

Secondly, I haven't seen a response about the copyright. I'm
referring
to this part in particular:

+ * Copyright (c) 2025, International Business Machines (IBM)

I shared this link in my first reply:

It says, in part:

"May I add my own copyright notice where appropriate?

No, please don't. We like to keep the legal information short and
crisp. Additionally, we've heard that could possibly pose problems
for
corporate users."

I had to talk to some people about this and get their ok. I'll remove
the copyright text.

- added gcc 14-14.2 as known broken compiler (bug was fixed with
14.3)

Why do we need to mark this as broken? In my research, it caused
compiler errors with those versions -- that itself should cause
fallback to sb8.

On that note, I'm now wondering if clang compiles and links
successfully but the program is broken? Or does it fail to compile?
If
the latter, we should treat them the same: there is no need for
"known
broken versions" in the configure checks, as it's just a matter of
documentation.

Yes in all cases the compilation will fail and we will fall back to
sb8. However personally I think the user should get feedback of *why*
something fails. I'll remove the check and document the broken compiler
versions.

- create dependency to getauxval in configure, so we don't compile
the
code if we won't be able to detect the cpu extension at runtime

That's just unnecessary clutter, and we don't do that anywhere else.
We already have the HAVE_GETAUXVAL symbol to guard the runtime check.

Noted. I'll remove it.

As I alluded to before, I'm not in favor of having both direct-call
and runtime-check paths here. The reason x86 and Arm have it is
because their hardware support works on any length input. Is there
any
reason to expect auxv.h to be unavailable?

I tried to find a reason but did not find any. So I'll remove it.

Also, lately we have been moving away from having separate *choose.c
files, since they add complexity to the build system. I'd suggest
looking at src/port/pg_popcount_aarch64.c -- the file is compiled
unconditionally but inside it has the appropriate #ifdef's as well as
the choice function. See how it handles auxv.h as well.

I'll change that.

Your tests demonstrated improvement with 32 bytes and above, and
nothing less than 31 makes sense as a minimum because of the 16-byte
alignment requirement. I've mentioned that we don't want the 20-byte
WAL header computation to have any more overhead than it does now.

Sorry, yes that's true I'll change that back.

--
Eduard Stefes <eduard.stefes@ibm.com>

#13

Eduard Stefes

Eduard.Stefes@ibm.com

8 months ago

In reply to: John Naylor (#11)

Re: [V2] Adding new CRC32C implementation for IBM S390X

Hi,

here is V3 of the patch. Changes from V2:

- removed IBM copyright
- removed GETAUXVAL check in favor of the already provided check
- moved runtime selection code from pg_crc32c_s390x_choose.c to
pg_crc32c_s390x.c and removed _choose.c file
- removed the config time compiler check and let the buildsystem fall
back to sb8
- changed buffer limit back to 32 bytes before calling s390x specific
implementation

--
Eduard Stefes <eduard.stefes@ibm.com>

#14

John Naylor

john.naylor@enterprisedb.com

8 months ago

In reply to: Eduard Stefes (#13)

Re: [V2] Adding new CRC32C implementation for IBM S390X

On Tue, Jul 8, 2025 at 6:46 PM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:

Hi,

here is V3 of the patch. Changes from V2:

- removed IBM copyright
- removed GETAUXVAL check in favor of the already provided check
- moved runtime selection code from pg_crc32c_s390x_choose.c to
pg_crc32c_s390x.c and removed _choose.c file
- removed the config time compiler check and let the buildsystem fall
back to sb8
- changed buffer limit back to 32 bytes before calling s390x specific
implementation

Hi Eduard, I look a brief look at v3 and it seems mostly okay at a
glance. There is just one major thing that got left out:

On Wed, Jul 2, 2025 at 3:27 PM Eduard Stefes <Eduard.Stefes@ibm.com> wrote:

On Wed, 2025-06-11 at 13:48 +0700, John Naylor wrote:

As I alluded to before, I'm not in favor of having both direct-call
and runtime-check paths here. The reason x86 and Arm have it is
because their hardware support works on any length input. Is there
any
reason to expect auxv.h to be unavailable?

I tried to find a reason but did not find any. So I'll remove it.

v3 still has direct-call and runtime-check paths. Let's keep only
USE_S390X_CRC32C_WITH_RUNTIME_CHECK and discard the direct call
configure checks. Once that's done I'll take a closer look and test as
well. The rest should be small details.

--
John Naylor
Amazon Web Services

#15

Eduard Stefes

eddy@linux.ibm.com

8 months ago

In reply to: John Naylor (#14)

Re: [V2] Adding new CRC32C implementation for IBM S390X

Hi,

On Wed, 2025-07-09 at 13:53 +0700, John Naylor wrote:

v3 still has direct-call and runtime-check paths. Let's keep only
USE_S390X_CRC32C_WITH_RUNTIME_CHECK and discard the direct call
configure checks. Once that's done I'll take a closer look and test
as
well. The rest should be small details.

done :) here is V5 with only runtime-check.

--
Eduard Stefes <eddy@linux.ibm.com>
IBM

#16

John Naylor

john.naylor@enterprisedb.com

8 months ago

In reply to: Eduard Stefes (#15)

Re: [V2] Adding new CRC32C implementation for IBM S390X

On Fri, Jul 11, 2025 at 7:01 PM Eduard Stefes <eddy@linux.ibm.com> wrote:

On Wed, 2025-07-09 at 13:53 +0700, John Naylor wrote:

v3 still has direct-call and runtime-check paths. Let's keep only
USE_S390X_CRC32C_WITH_RUNTIME_CHECK and discard the direct call
configure checks. Once that's done I'll take a closer look and test
as
well. The rest should be small details.

done :) here is V5 with only runtime-check.

Great, thanks! I took this (v4, for the archives) for a spin on one of
the LinuxONE instances in the buildfarm. Since godbolt.org doesn't
have clang for s390x, I tested on a machine with clang 18. With that,
I found that the configure check successfully compiles and links
broken code. :-( This is different from gcc 14 which "harmlessly"
failed to compile (until 14.3 fixed that). So, it seems for the clang
versions that are broken we actually do need to guard within the test
programs after all. Eduard, which compiler versions have you tested
the patch on?

I found the patch passes tests with gcc 13.3 on this machine, then
looked at the configuration outputs.

configure:

```
checking for __attribute__((vector_size(16))), vec_gfmsum_accum_128
with CFLAGS=-fzvector -march=z13... no
checking for __attribute__((vector_size(16))), vec_gfmsum_accum_128
with CFLAGS=-mzarch -march=z13... yes
checking which CRC-32C implementation to use... S390X Vector
instructions for CRC with runtime check
checking for vectorized CRC-32C... none
```

This looks strange. I think we want "CRC-32C implementation" to report
"slicing-by-8" and "vectorized CRC-32C" to report "S390X Vector
instructions".

+# Check for S390X Vector intrinsics to do CRC calculations.
+#
+# First check for the host cpu
+if test x"$host_cpu" = x"s390x" && test x"$ac_cv_func_getauxval" = x"yes"; then

I believe ac_cv_func_getauxval is a leftover from the previous patch.

meson:

```
Checking for function "getauxval" : YES
Checking if "s390x_vector_vx+march=z13" : links: YES
```

This looks fine except for the "getauxval", which is due to the duplicate check:

+elif host_cpu == 's390x' and cc.has_function('getauxval')

This check already happened at the beginning of the file within
"func_checks" and defined the corresponding HAVE_ symbol, which v4
seems to use appropriately in the runtime check.

--
John Naylor
Amazon Web Services

#17

Alvaro Herrera

alvherre@2ndquadrant.com

about 1 month ago

In reply to: John Naylor (#16)

Re: [V2] Adding new CRC32C implementation for IBM S390X

On 2025-Jul-17, John Naylor wrote:

Great, thanks! I took this (v4, for the archives) for a spin on one of
the LinuxONE instances in the buildfarm. Since godbolt.org doesn't
have clang for s390x, I tested on a machine with clang 18. With that,
I found that the configure check successfully compiles and links
broken code. :-( This is different from gcc 14 which "harmlessly"
failed to compile (until 14.3 fixed that). So, it seems for the clang
versions that are broken we actually do need to guard within the test
programs after all. Eduard, which compiler versions have you tested
the patch on?

Did anybody want to update this patch? Seems sad to have progressed
this much and abandon it at the last minute.
https://commitfest.postgresql.org/patch/5779/

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)

Review/Pull Request: Adding new CRC32C implementation for IBM S390X

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: