[PATCH] SVE popcount support

Started by Malladi, Ramaabout 1 year ago35 messages
#1Malladi, Rama
rvmallad@amazon.com

Attachments protected by Amazon:

[0001-SVE-popcount-support.patch]
https://us-west-2.secure-attach.amazon.com/a29c9ff9-1f9b-430f-9b3c-07fde9a419aa/f9178627-0600-4527-bc5c-7e4cb9ef6e9a
[SVE-popcount-support-PostgreSQL.png]
https://us-west-2.secure-attach.amazon.com/a29c9ff9-1f9b-430f-9b3c-07fde9a419aa/13c252c4-c45e-447c-9e55-fe637f8d345c

Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024, 15:43 (UTC+00:00).
[Tell us what you think] https://amazonexteu.qualtrics.com/jfe/form/SV_ehuz6zGo8YnsRKK
[For more information click here] https://docs.secure-attach.amazon.com/guide

Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.

[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

#2Kirill Reshke
reshkekirill@gmail.com
In reply to: Malladi, Rama (#1)
Re: [PATCH] SVE popcount support

On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:

Attachments protected by Amazon: 0001-SVE-popcount-support.patch | SVE-popcount-support-PostgreSQL.png |
Amazon has replaced the attachments in this email with download links. Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us what you think
For more information click here

Please find attached a patch to PostgreSQL implementing SVE popcount. I used John Naylor's test_popcount module [0] to put together the attached graphs. This test didn't show any regressions with a relatively small number of bytes, and it showed the expected improvements with many bytes.

[0] /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Hi! To register entry on commitfest you need to send patch in one of
this format:

https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F

This is useful for reviewers who use cfbot or cputube.

--
Best regards,
Kirill Reshke

#3Kirill Reshke
reshkekirill@gmail.com
In reply to: Malladi, Rama (#1)
Re: [PATCH] SVE popcount support

On Thu, 28 Nov 2024 at 20:22, Malladi, Rama <rvmallad@amazon.com> wrote:

Attachments protected by Amazon: 0001-SVE-popcount-support.patch |

SVE-popcount-support-PostgreSQL.png |

Amazon has replaced the attachments in this email with download links.

Downloads will be available until December 27, 2024, 15:43 (UTC+00:00).
Tell us what you think

For more information click here

Please find attached a patch to PostgreSQL implementing SVE popcount. I

used John Naylor's test_popcount module [0]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179 to put together the attached
graphs. This test didn't show any regressions with a relatively small
number of bytes, and it showed the expected improvements with many bytes.

[0]

/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Hi!
I did look inside this patch. This was implemented mostly in the same way
as the current instructure selecting code, which is good.

=== patch itself

1)

// for small buffer sizes (<= 128-bytes), execute 1-byte SVE instructions
// for larger buffer sizes (> 128-bytes), execute 1-byte + 8-byte SVE

instructions

// loop unroll by 2

PostgreSQL uses /* */ comment style.

2)

+ if (bytes <= 128)
+ {
+ prologue_loop_bytes = bytes;
+ }
+ else
+ {
+ aligned_buf   = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) +

sizeof(uint64_t);

+ prologue_loop_bytes   = aligned_buf - buf;
+ }

For a single line stmt PostgreSQL does not use parenthesis. Examples [0]https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179 &
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68

[0]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/intarray/_int_bool.c;h=2b2c3f4029ec5cb887bdc6b01439b15271483bbf;hb=HEAD#l179
[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/pl/plpgsql/src/pl_handler.c;h=b18a3d0b97b111e55591df787143d015e7f1fdc5;hb=HEAD#l68

3) `if (bytes > 128)` Loop in pg_popcount_sve function should be commented.
There is too much code without any comment why it works. For example, If
original source of this is some paper or other work, we can reference it.

==== by-hand benching (I also use John Naylor's module)

non-patched

```
db1=# \timing
Timing is on.
db1=# select drive_popcount(10000000, 10000);
drive_popcount
----------------
64608
(1 row)

Time: 8886.493 ms (00:08.886) -- with small variance (+- 100ms)

db1=# select drive_popcount64(10000000, 10000);
drive_popcount64
------------------
64608
(1 row)

Time: 139501.555 ms (02:19.502) with small variance (+- 1-2sec)
```

patched

```
db1=# select drive_popcount(10000000, 10000);
drive_popcount
----------------
64608
(1 row)

Time: 8803.855 ms (00:08.804) -- with small variance
db1=# select drive_popcount64(10000000, 10000);
drive_popcount64
------------------
64608
(1 row)

Time: 200716.879 ms (02:21.717) -- with small variance
```

I'm not sure how to interpret these results. Looks like this does not help
much on a large $num?

--
Best regards,
Kirill Reshke

#4Bruce Momjian
bruce@momjian.us
In reply to: Malladi, Rama (#1)
Re: [PATCH] SVE popcount support

On Wed, Nov 27, 2024 at 03:43:27PM +0000, Malladi, Rama wrote:

• Attachments protected by Amazon:
• 0001-SVE-popcount-support.patch |
• SVE-popcount-support-PostgreSQL.png |

Amazon has replaced the attachments in this email with download links.
Downloads will be available until December 27, 2024, 15:43 (UTC+00:00). Tell us
what you think
For more information click here

Please find attached a patch to PostgreSQL implementing SVE popcount. I used
John Naylor's test_popcount module [0] to put together the attached graphs.
This test didn't show any regressions with a relatively small number of bytes,
and it showed the expected improvements with many bytes.

You must attach actual attachments for this to be considered. Download
links are unacceptable.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#5Malladi, Rama
ramamalladi@hotmail.com
In reply to: Kirill Reshke (#3)
3 attachment(s)
Re: [PATCH] SVE popcount support

Thank you, Kirill, for the review and the feedback. Please find inline
my reply and an updated patch.

On 11/29/24 2:37 AM, Kirill Reshke wrote:

PostgreSQL uses /*  */ comment style.

Fixed in the attached patch.

2)
For a single line stmt PostgreSQL does not use parenthesis. Examples
[0] & [1]

Fixed in the attached patch.

3) `if (bytes > 128)` Loop in pg_popcount_sve function should be
commented. There is too much code without any comment why it works.
For example, If original source of this is some paper or other work,
we can reference it.

From experimentation we found that for smaller buffer sizes, the
overhead of computing prologue, kernel and epilogue loop parameters is
high. So, for|< 128B|buffer size case, we use the SVE|8-bit|loop and for
larger buffer sizes (|>= 128|), we use the|64-bit|SVE implementation.
Attached is an SVE popcount implementation comparison.

==== by-hand benching (I also use  John Naylor's module)

non-patched
```
db1=# select drive_popcount(10000000, 10000);
Time: 8886.493 ms (00:08.886) -- with small variance (+- 100ms)
db1=# select drive_popcount64(10000000, 10000);
Time: 139501.555 ms (02:19.502) with small variance (+- 1-2sec)
```

patched
```
db1=# select drive_popcount(10000000, 10000);
Time: 8803.855 ms (00:08.804) -- with small variance
db1=# select drive_popcount64(10000000, 10000);
Time: 200716.879 ms (02:21.717) -- with small variance
```

I'm not sure how to interpret these results. Looks like this does not
help much on a large $num?

Can you clarify on what system, architecture did you test this patch?
Note, the patch has optimizations only for `popcount` and not `popcount64`.

Attachments:

SVE-popcount-implementation-comparison.pngimage/png; name=SVE-popcount-implementation-comparison.pngDownload
SVE-popcount-support-PostgreSQL.pngimage/png; name=SVE-popcount-support-PostgreSQL.pngDownload
�PNG


IHDR���O�LiCCPICC ProfileH��W\�G�wdB����� "#��V�ATB ���-V�NDGE� �q�V��m��B�kq+��Z���w���?�{����������JsQM�$���`�������Z|�\�����/�oD�_sPj�s��-�H.��8](�A|�E ��@�B�|V�T�� ��A!�Q�LnQ�t�2h���	du>_�	�F�Y�L�C��'�P,��b���B�Alm��t�>;�+���i��h���#X�`!���\���3����*����U=K����I��0%V���$=2
bmP\,�Wbf�"$Ae���\�3��x�<7�7��
�aB�!���)�)m`��
q>/b=�kD���!�S�������q9C|7_6��R��"'������D�!}��0+>	b*���H�5 �����
��fq#�md�Xe,�D�`�>V�!����'�;�%�E���Y�!�\aO�A�a,X�H�I��'��"�b��"IB������������47z���+y3���q�s���T������x��xe6?4Z���.,��5��@������F��@&�!fxF����q����G����@�?�b��x�S� chL���B��@.��T��x��@F����
`��*��=?�~a8�	b�+�����@b1�D��
p�����8����=�)����p��I�3]\$�e���AC�I�:?��t��qo��q&np���+�B�;��2+�Q����+4dGq���1?����v�#*�\����#�����^��U���m�}���c���X�X�I�	k��+���{2���W��'���3_��2�r�:�����|��|����!�#gf��8��!b�$�q,g'g�����U��{a�}���
������c_���p����_86|��p��@!+Pq��!�'�}����3p^��P�A2�����\f�y`1(�`5X*�V����� h-�4�\W�
p��.������ $��0}��D�g��� �H8�$#iH&"A�<d	R��E*�mH-r9��F."��!�����G1T�A�P+t<�F9h�NE3��h!�]�V����=�^Bo���s����L1��q�(,��d��+���z�^�kX'�����8g�p��	���/�W��x
�������>�3�F0$�<	<�dB&a��PN�I8B8��.�k"��$Z����L�&�%� n&�#�"v�I$�>���M�"�I��b�F��I�UR�-Y�lBv&�S�r����|�|������I��xR�(B��*�J3�2�����E��zS������
j=����������Z��Xm�Z��~�j���k���s�S��+�w��R����F�Y��h)�|�JZ-����C�Q��!�X�Q���qU��B��s�����r�!�ez�&E�J����\�Y�yT��f�Ck�V�V��
��Z���I�V���B������h?f`s�!`,a�`�ct�u�ux:�:�:{u�u�t�u]tug�V���dbL+&���\�<���|?�hg�h��1�c��y�7V�OO�W��O���{}�~�~���F�����A��,�-�z����+[2���_QC;�X����
��������F���3�������O��0L|L�&e&'M~c��8�\V�,����4�Ta��������Y�Y��>���Ts�y�y�y�y���E��<�:�_,)�l�,�
��-�XY[%Y-�j������YZ�Y�����������nK�e���n��b����e�U�]�G�������;��y����w�A���P�P������X����b�����k������)�i���	�B'Mh�����������D����'6M|�b�"r��r��������������������=�}��-�;���}�������������g��A�?��r�v{uO��$��c�co3o��6�N�O���>����|�j�G~�~B��~�8��l��'���7\O�|��, 8�$�=P;0!�2�A�YPfP]P_�k���S!����5!�xF<����:?�l�zX\Xe��p�pYxs�.�^�e�$�1
D���E�����},�S�4vB����q���q��^�������`��HhM�'�&�&�I
HZ��9y����/%$���RH)�);S��NY?�+�5�8��T����^�f0-w���������iIi��>������t^���>W�A�\�',���EkE�2�3�ftgzg������*��s�����!�[���D����M���G�K�;*���H��0�1{F��^Z,���9s��>Y�l��O�7����6�����������g��5[2�m����s��0�+��:�t��y�s�o[�,H_���|���]���,�.�Y�s�S�����$-i^j�t����SW�Q,+���k��o�o���/��|���%���J�J�K?�����	�U|7�2ce�*�U[VWKV�\���f�������E�k(c�����~����.�[7P7(6tV�W4m���z�����U�U�6nZ���f���[���o5�Z������oo��PmU]����`���;�����v������vIvu�����u���m�{UZ�����������M����1�����;�v����������[�t�q��i�����������q4�hk�W��c��v���T�=�����'O�����=�y�q����g&��~6�l���s~���y����/�\��x�'�O���.5�������#�n�
��/7]����1���U����\��:����7:n&��}+�V�m���;�w^�R�����������_���A������t�<�0�a���Gw?"��k�S���g&�j���[z�z��6������z���}��������or_�K���?W����/��Z�����{��M�[��5�����Oz��������l?5�|o o`@���?0�<�d��.h�0���:Eu>,��L;����3�`q�~������[������M ��'�������SY��l�}����t�o��L����{�Tu��:�����9�eXIfMM*>F(�iN����x����ASCIIScreenshotv�*	pHYs%%IR$��iTXtXML:com.adobe.xmp<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <exif:PixelYDimension>728</exif:PixelYDimension>
         <exif:PixelXDimension>1740</exif:PixelXDimension>
         <exif:UserComment>Screenshot</exif:UserComment>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
�I��iDOTl(ll������@IDATx���\�����V�&.�**�jIhP��%�^�x(-Jy�{�k�%�'T���Z�n
�V�DUqi�IT��u����q�?�k�;����3g�9�g>��Jf�����~�=��o��Z{�E���<��#��,P�ij��-Z�������j��F��>��bn�F����������������f?h�h1�c?(�vkt��-Z�������j��F��>��bn�F���~�	�FSS_O���{b�,�y��g[�fK�zS�y��~�<��7[�~�����l����������o�e�4��������~�,���y�Eo����7��g����-z�%��f��UXv��v��+���`?(�fkx��NZ�
�
���h�����B��Bn��7���������������f?h8i!+d?(�fkx��NZ�
�
����r�A������k'�xb�@� � � ���9�;���X��
� � � ��%c�X���2�,��@@@�tkib��f � � ��P,K����)��# � � �@�b��'����KM � � �@J��b�}fQ�G@@@��|���{�,@@h+�T�E���vV@@�I�i~���H[@@@�VH�_$�Zm+�> � � P�����
E@@
(���H�pC�d@@h
����X;�@@hN�T�E��9��B@@�6�A��n�Ug@@@�^H�_$�zms�`@@hw���vwa�@@@��H�_$��S��@@@�
>H��>�[ � � �]H�_$�����@@@�z|�������!� � �TH�_$���1 � � �->H����0*E@@����; � � �K>H����� � ���@*�"a����D@@�f�A��n���&@@@�UR�	�V��� � �N�i~�p+B�@@@�	��/f��4@@ZK�i~�����A@@�K �0k�mDk@@@��|������UE@@z\ �0����@@@�%>H��� � � �t�@*�"a�}��� � �T�A����!�D@@��@*�"a�%R>� � ��/��4?]�|@@@��@*�"aVM��@@@�n�A�����Q- � � ��b�T�E��]@@�%���^j�E@@�B �0k�M�J"��'��SO�?��[k��l�u�ihCf��m����_�o�����O��.�Ze�=����{��l��7�6;�#� �A�i~��L<A@�b�:�� �@���/fm��k[�E���/�lz\u�Um���KV��K/�y�������.��G/�������Z~�������Uuf������w��	�|����z��[o��O?mo��f�s�5�����4\'�~�a[e�Ul������^������_�����������j�����0�0��O>����k������>�!�j�8q����������_�b�����UW]ew�qGX�O~������n����������5�����.��c�]�]��������7L�����~�^}���h���(O]�S|��]������f�A��������j��X-Z���X����V7��X�lb��q���R�	���fn������W=B�9��N�>}�t;��s�����a�����i���Gu����;�D��^��O/N�<�~�������k������X:!}�����1���l���������|���kZ'�n��&{���C�LcYq�m��6��w���{��=����?�fk{|�������r��o��3�<���J�������R���x�����y��v�W��V�.%�u�����5�u�]{���Y�r���.��r����:������[/\,��;�"�N+�K/t�w��_���6m�4{��'�Aq5V^ye[w�u�+_��m����N���z���3��^{m������}^h/����R`m������#;M,�X�f��X����kvb��o!b���Rc���k�Y�f-�U4(<��J{
hh��;,���g�q�m����ix|�����c�	��Q�Fux?>���h_|q|j���'�J/����:������?��M'�T.���N=I��B������r�{��u���?�i��Vn���N����~�iS=�y������5�I�M?��C=d�������������R���UV�XJ��#H;����o�[8�2~�x���_�.��N��d��0k��`O����tr�?�AH�|�s���i�:�{��'�x�b����'����K�����.�Y�}���������K/
=���
:4�vH��6���>~�T���[[ i��2#k�m�g���:+�u6��+�b]�+�g����|H����o1�Xz��]�*�Xy��(�c��	�(�cI`��)�d�/��
�<����?�x{��gC����.�fgR]�3�o|���/})>-=j������
=�>��O�i�������,�J�pG���7M	:
e�����_��������s'��mR��O��a�8�����<������7�S�PI�f,J0>����*�w��C���k���n�-4��+���VZ�)V!&F��f��GL=$^y����P	>
���j�Rl���������Yl'�xb�UJ�5�w�W�
�p]��o����T��
A��*���Z�XB=�trU������1"L��E����]�}����n�� _
!������'l��W�\�7�7��Mi�L���EH��)	��L{>�A	����Z�����;����C�y�$k��!���4o������X�X,�g�j->��cH�(�cI@=�}�Q��G>����/���M�4��pOJR)a�r�����[nY�'N��Z�l���v�I'���G��K��T��g��W����h
?�!}�}@4$R�u?������GE���:�tB+U�G���o=W��{��,M�Z����0V������g�)���w���4U��Rl���t�2
����������@����Y�~{��o� -�\�={v�����\t�E!y�S�E���w����E�������R�,[t��.�����[����%�0�������w��no��]{b����Xl�E#������u��0k���z���F�b�b�=�WkH�_$�jl�y��*^���H�������d��q��9�t�1���[��V������k���zx���
��������7L���w�/���3�4D�����=��-J�)Y���d�V[mUsu����pp�������x ���}����^z������~:��S//
��y|�/
/�{0����e��^x��X���on�����{���+������s�=�{��q�&{�_E�M�E=�����Jp*��v*�����.�S�N
'�(�<Z�z���p�5��y��T-�����b���zj�������S�I�����w/\�0��F��������� �4P�E_t�v���7o^��m�$�B�����w�P=J,��1#�M�T/
��u�������]��������I�<A��=�S^Q�/T��8,��#�����	B�}SX�5�\��8�6���4\��Y�}��l6Q��j�;��P6����WH-[E�3�.���������`����:��u�]�%}G���qB��k2�������w�&ZG�h���S��%~���h=U�\tA������?���B�]������6�,l���z^KU�z)�0�}�]�������q�����-Y��O
�]8�����QE�a��4��*v�UK�hXR�4V����u:��z|W=�����:������t,�_4{�d����Yu��������T�1���2<�����@mQBL��r�'��]�CX���m�8��!�$��c{WZK;�����������^B,F,F,��A,��]�����X�����������ORCc|,k&a%x�'O����ga����'$co�!C���E�J'=��p"Q������	
��2|���[K�����&�h�T�!�����c�c���1�[�QW���J�>H��x�\	�����t��X�������|JZ��c����7���*���J��w���nqY��hl���?������z-U��v)���������q6���^WQ�B%d���z]	�C9$���<E����*>q�����s�
7�� ��I��>����*���g����2��~����u�5[4����&�����(����7��:���'�s�����kK	o��U�E�������g���6%����]���#Gy�))����J��+�>j�Q�VC���z�h(P�J���O�ms���?�^�?���J��	%L��R9��B�'����`��?�z�x��S��%u�L%���i����9Mo����"�b��k����Ur*[tA�g0[�/��������>���>�3_��WJ/��F�|��_�/�����D�$�_���-%�G��N>*HJm���u��>�z�������n���P��:�{K=��<��Z��h���.?Tal��>r��������������qs�]v	��[��
�oCtl�T�]������D����T��~^�[_�i~����5�
�eE������;�J��i/!#�~P��}6[�q�X�[����C,vA�(=��v��X���0�$����Y�l�fi�N����zw��c�5��RR���P��=����n%�t7�K.�$�p������D*���(���o;����e����k��~����R'\��Z�#7�tS�X��\����'��6������P�1��S�-�����TQB@'Z��?���v�����>����)y�^~[����?���MnrU�(
������/|����w�y�]q��u��m�m��!,(��J���Of�j��lR�4]!�"%_��L��Z�j4Dgls���S������h��S�S�O��{em���!a���U��O2�Y������nqO��F*�{��B�X�!%�P�yW���������vP�m���������j��p%sc��{��;%�	3������������g���W��X��M�o��������wA�MT�|V���SO
������>�����>��M�7:aV�w04(���n����%�.)��E��}X�;�j����/i�[��C�a}6~_�K�3��*�{����]��z���5��Z�_��a��e�;����E��z��� -�qK=�����T�p������z?��1�����%W�7'��QIim/�z�����9n���+��@RE��z����O�����e%;U��^x�?9��|�����wi=�X��F�y��7�|sH�����VT+��I����H�mZ�n�o-��*��^�/�_[	���X��fO�F"[r�&����X-�S�mR�b���v��Eb���6b��7�X,������X���Y��1���W��������?0)����/�b<������v�������	F
��{�������0$`�_�f�=���B�����4���R���Z���\�4�sMg����R%~y�{���^���|pHJ(	�JR��r%��p��Yt��^�'7���Jl���N"+��E~�r�W��gH:��%&�b;��g����G�V�6-�_����?��?���/%Wb0�=������A������D����E�=�|Q2OCs�C��^A���K�A�$cw��(��_�������(��^����Ux��w��*�����UbB�5c��N�<%�T��*o.�����'�����n������L�Y�!�����o�L��Nv�L��iWO����&�n��F�?��s�:������NP���~��}G������}]������'jur^�8�UCb�cP��=���X�k�a?[�;�K=����Wr@A��yJ���J����3t��0��ta�.P���1���zojhG����Z�"m_
���gJ��{��TLj+��������>j?���:��W����~��J��u�������+���j�\���>��?���o��F�}\��%P�8����m�F��-�������^Z��4�����oX-�XWj�]����'Y��t�gt<���B�,J�(��{O���F���-���X�X�Xl���'�w�#3�1S��C*)�z5���X��
�b.(%{5�
!��.b�JG��
�T�E��7�D�.S?.t�JE���'��D%C�{���S�D���?��J��g�W'�uRUEW^�Jx]e��CLZ�7��$�^�
�Z�����y�N��w��>��m�\�J����T20��IW�+��'��<���L=kT|/
�����$���pT:����n�m���O)������������U�v��?������rJ��:���N���O�S�U�'J�������w��m��
���}0U���{���W�!��N����U����aH}��J����Z���z#�aGco�ru���:(�}g�����(�����J��y�����X���lL:k�J	�N�b�<�����?��	3] ���%���R����v�a������Z1�?��=%�p��{����K���h�������j8_�*�8Bd��:�w����]	�j=n����U��ov����zb����.hZEIb%�T�=��z���0S"U;T����M���=�5�O�����8�t�[��
�"|��G���{$����+N��U��k�m?�c�	� �O��D{��?f�u�������?b���b��I��-��w�X�X,�"R��.b���q�z_ �0����-��l�@�|J�(Y����pq������Q;��g����tr��D'tw�a�P�z�����OJ��+�f�
C"���/�G�U+����":��\��������NV����OR�<���u�X�:Tt��}��'L�a�tS<9�7�6��N����oK�	�xs�r�J-W��2^E����T���r����I}�2Qb�% 44��.��^���q��z��K`����mq�9�����U*������f���8 \��LS���J(����z���cL������ f|�?��������'N����3����[� y�����z�cO���y����K�O����	~% �E=c�X	%vb�7a�����*qh@M�m��	��w[mV��N���WQ����������*J�)����y��~8;�V/Z��H���^~�������;]	�j=nu%a��c�lcrE��^�N�����N����=��z���0�E*���^{�e{���ovH���=���l2-V��]9���]��Zv����_��E�_���q���/�����@�}@�����5%����XlI�*��-���������L��Xl��3b1b�x"�<]��bq]H�E�6T�����7�i���K��H'�U�C4�5:�?E�Jt�%?��#1���a��X��ZvX6��*��['����Sz�+Vz13��'��uo�Xt����?��G�����f�~<���V
C���JH�$��bY����^���VP2F����F�DZWf1��{���s���J�����?���:�-fPCijX<���T4L�z�)y�!�TR'W�u��z���Wf��sY����_�z�
>�������+G�Lm%*������~{N�$g�����������^&�O���2�w<&�����y�u����J*a�����d_S"]	�X�%���R�6����}�_�`��Z!a�a%u�=��>��H^�5a����$�.lP�������={v���>�{4)I�DX��{W��� ���V�	�����+���W�O=��E ]9��r���0S\�}��NsuOB%�T|�5��#����cl���8&�r���f�E�y"������$c����Yi�G����Vi��'[2��X���Xl�	��R�8E,��	3b���[���1>������}�sH�b�s8�bq���H,���W�����Y���^���5�?l��������E���/�-�$����Ij%�k'���S="t�]�9���*y��yu��r������`��X��W�z��d�GJ}N�Ci��I��}��)��O�i>�|����RI=�T�C5v%a���X<,���{��$����c'&(��@��T�UI'��L=U*�<�}m��-�����Ru)�"w%6}=�q��F��J�(���l:���:{�)�����h2
G�O������a�=��/'�k�?T��;u��������N����Q�J��K+m����C2�{{Di��|t<�w;&"5�����sC�H%�*��`�<�X�fJ������oz��L�wc�-��g���%��
�|���m��f�x��sv��yW���	�r��z��F�brE��JR���G����������{�����8���u���Y�N%Nu������=�����>��-�~W���C�j���x%����5�0��MI�E��~�A��no��Z��;DkM,�t���e�bKF&!#[��X2E,��	3b��'����|���:��%��y�H�T�E�����K=�t����x�VG��w7����:��:��G�o�L�����������BO��W_=�k��r���������:	�]���OY�=]	��^:��l�����z��$���xQ�&�tee�2Q�k��w�����r��W�J+����=Bt/&��{�������s2����Xw�Q',�/�d�N �DI�3����*[�<�=.��g��'��:��D����CK�g������O�d0�
��CYE�����>L��8��.��f���q1J���L����GmW�zK\G%�b�;[W�}"�/��wh��r	34������~m��VaRRN�m_�T�L�QR2������{�e���]��UV)
�������f����Q�*�����S�[�����y���<���'��.��E*�GS��I�q�]i�O���������<�W��Uo�L���1���JW��j�J��F��3�{�}C��TQ9]`�{_i�<�*���<���]����N��U�<����b����j__�����L��@����@~�����)'@,VN���-��x�X��d��]����>�XV��sb�%�bK�oQb���b�%����+�b�r���&���s��bq���;R�	���.X�~H$�����U�}����]w�5������kt^�6��v:�D��X4�Q�	��h������&&���u���������U����z\�d�N���Q�JWf�+��]C9�^JZ��o}�[a����^�W�)	�+�O8��0�������7a&��q��t�!%������.%B��TO%���r'����>��z]IPo���;����z�i����Q�A�
�m�{�)�S>)����)�^O�e�H����r�-�-%'5�[��']����N%2���=
}���0�y�R.H���T2��S�'���(��|�8}7�l�R3�g��-�LH�[�$1��M�3��`��������U"7[�]�9����P	i%�U��:;�k���n��n�)<�D��:��{������l��M�%��>����+7$c���O��������z�+]9���^��QmW�^(�D����?�|�����.�z���V�u�M��x�����O=�t��>���Y3_4�����������M���n-����ZK�������
b���b��������0#[�7�-��3U��vb�<��y���.��}����Y�y�f��L�����zz���M�
\'#u��z$��;���oL�T4��k����d�����:I����*\��������i�����te\���DJL�|���W�g��>����.��(�0���z��CE�G�NF{�xQmK%���c���#(���'��'�uR['�Ut���w��/'uX'��9�(\n���}�bo�<'3S'��Mu��N���ly���L�����bO��|�z���O�>���?p������`�z���z���z�m������beJ�����.�_���n��}WU����O<�Di�H��K�`������iy����$��W���}c����������'G-K�C%��C��zR��f�U���z������z�j�����/z�������C�3����v�m��P{��p�Gms�Zf�����2������C������Z��JD(��^<�h��x����U����~4�X���}��Q���BjT�_m����?��������w
}��������6jJ>h��/�]���M^U��sz�������Y���c�Uu�D�'H��	����;�k�T�7������/�4�S�y����h�|*y������%5u��+�+�I)���_��{5�{��&�=S5_=�4��c�	�c=�Zo��Y#b1b1�'�����y�N���U>��R�B,���E,Vn�Y�:��R?E,�5�n��6�0�m�����+@�&�R$>���������C������J���J?��;�D�'���z��~K:A���J��$w�-��P��Z��@�l����Yu������N��g���5c�������MR�J�
`������;��R�-%�I�T�LW����/���V�-�w����v(�)�O|������O�@M���L�j��9sB3����7HS�����
���z�K'����a�C�������I{%�t�K_�����0�����S'�5����#;,��j�o����!_�y%��RE������?���a��({B������J��g�����?J��Y��������n�}7w%5��SM�������7��
�z��C�)?�px]�e�������������?��04��%��=T�~�������%���V�@�u�Y'�MC��*:c���}3%�d����:��6i�(!����%aV�;X�?���,Z/d$W%cu�wS��X|P�$\�<�v��_��m��*���tLz��u�K=�ul��ECZ����i�� ���=����h���6t��wJ�m�/��\�mT�����%K��Q	\%�c��s~�jM��N�J~���p�2��A4h�O%�c;��XOrE��z���7����V^�wF,�'��V���"�}���X:6j�<���clW~W���	��.k��v��5��g�{�!K��%�7^��m��e{������������t��9kH,�d {9@����,�a�b�b�B���B��X��9b���[��-��0U�@*�"aV�cK}�_��66�R���C=|�C���RI��{y��z>��Z�����(������s%�����J�:���Z�x�������Dv����)���P���z���4���I�X�m�Jm�	l�t�����p�:���'.W���^���%���xR['��cK������P�4�?�^�����z����Du"7�$���_�~w���9���[zI=%��P���8U����(������?W3o�����^��P�$������O4V�JI,�<��	3.��'>�k�Q�T�P	�l�B�!{P���$�OPi_P���r	3�)����g�����fz]C��d��O�jC��[K�,�w0������\��e|?>*1���{���Q~��fX�D�����{���uv~�\���SO�4���Y����J�*��D]�mT]JJ����X����wR=c���_�zf:���T���]�_v��<����},+�[�����F�}|��	�^���}\��� ���TW������PBV�1.X����tx]��.������M�k��T���<<o
����X;�����X=��O��;I,V���J�W��~�T��k�ME,F,���~�N�u!�|��X�X��7�gY�T�E�,��f�u�_?�u�Td4�\���#F�=�te������o4�S���K'��E�y)�{"�{ce���v���0����9�{zm������z�����$�z'���:��D�zH���=����N��/Wv�N=9tB1�����=n���p�\O�$�Hj�Tk��&����K=���cN=2b���W^N�+�����p��7&U�{����&����4��[�����w���O
�vQ��'����kJ�$~v����|���;���}��6S7%J�E��x_.�W�'e����>�+-���=���~��-��g�PO^�~�����/|!\���]���Sr�P|
5oQ\����N�P����2%���H	A%}�����DK�t�j�&U�.��k���<PZ��C:�,k�GI��y|�S�
����K����0&����D�������X�����J����G�c�K2���O��T�:�]��?)!�=:���/>H����0�1�����\�^�x��t�����u��5��S��K=m���~�^�q��2��i��>����U��t���:��{��t����K��F=�^�������j-���{��
�������P�}��WRS���T��g>.��t�m��Tj=�v�w�zJ�{�����;�;�����{��5���6T����B�??\j�m^j!��;R���n�UdU���i���o�>���$���j����S��'###�|�X�n��wb���x����y�S�@*�"aV�`�����q�<:9���$�N,�t�3����;J������5���	�TR-;�����t�W�+���s�z�=�'��.�����6:���ay�4�����:Q��U�9���r��N,�Dl�dJ���N'�T�Nr����q>���\�N�}Z�"�_g����_�:��+�U��Y&��(j�������W��J%�k�Ts���}A	mw]��8O-�y���d�N(���m�Dh��I�B��z�[����S�Q�m�T���~fZ���:�.'�W�j��zi�)���~�D��C���|z]�_n_��DIt�����Ok���}��� ~��}}���z������S��}�}qZ?*�h��,Z����O��XZo
E��;�����4���
7Zm����X|�����z&�e�m����z��r�V��r�>��$����j��<��}0�������������;w����O������!�G�:6)iK�m����c�������j?����${�{��4����z�+qOA��}���Wi�i���lkb��K�����w2�_��CR'#�GZb1����&b��6����(?���b���wb����b1��t�>�u�0�<"� �D��Yw7/�0��eRk�>wJ��W�zgf�N�h�_]�����rJv�B?W�Y�e��Q%����+G�&��4?��P ��$@,V37���X���M��T�E���6�A@`�A{B+�{j��A��z|�����VUC�j��V,RS��j�I
JA�� �O�y�F@�� �9k��}�bKl���ok��S�	�V��� �@�	���&m�Rp2f���}��!l���v����{��hi����^iV@�& k��C�r���b�6H�_$��x�`�@�W���o�_���~w�&M*��;[�{au�Q!����������sq��&�G��)Sl�����a�������o��gm�Q�h��t�A���<'� � �@O���2��	b��PfEH�_$���Ei; � �Z�i~��+E�@@@�&H�_$��|��<@@h]����]c�@@�}�T�E����-@@@�6�A��nSV@@��T�E��G�Y � � �Y�i~����� � � �(�T�E��Q��� � ��(��4?]c5�� � � P�@*�"aV �"� � ���A��n�2�@@@��@*�"a���g � � �@�	� �O�XX � � ������Y��2 � �4����ts��V � � ��)���H����f�@@@�>H��h:MD@@
+���H�vs�p@@(����t����#� � ��,�����5k�"��A����ux���4 � � �@b��8.3b��R � � �C��bqvfQ�G@@@��|�F����Y � ������"C2F	@@@��A�F���=�� � ���@*�"a�V�+� � ��$��4?�Lm�- � � �@�	��/f���Y@@(����taV��"� � �H�_$�
�!i2 � ������tk�k� � �4�@*�"a����V!� � �@� �O����� � � �@�	��/f��9X0 � ������t���� � � �@w
��/f�)N� � � �@���
�-@@@�.
��/f]D�� � � �@�>H������@@@���/f���@@����nY�"� � ��T�E���@@�%���^j�E@@�B �e����_����w�����4���|�v�yg[g�u��O=���w�}6w�\[}��m��A6d������_��	@@h{���V�y��7��{��?�������Z�v�e��e��ZlU��X_������>��\�����@@�O �e���SO�9s�������,���v��gWL�=��c6n�8[�p�������y{���m����=������}���@@@��>H��E\�<m��X��}��w���^z)�U�����Rm�j��z��y?�w�l�<G@@��H�_m�0����l�m����[/�_{��v�������n{��w��0r�H{���m��Q�������|����/���^n��_e�U��� � � �~>H���*q�W��[om	���������
fx`i�{5��R����Vo�������� � � �����-fY���������)H��)v�	'�;�`�vX�-%��p;��#l��I��n��J�c@@�A��n
�8z�h�b�-���������Q�d�E�z���/[?�@@@��R�W�'��~�m���������:*\�����M�	&��lC�-�2s�L;vl���������m� � � �Q�i~:�������F��=�{5���6�S�b���e��9 � �t�@*�j������������^����������e�w�}-"e7�w�aW]u�}��6d�������3
�1|�p����+���>��	��OA@@���p�
��,Z���h������O���g��m��F��nT���z��"O�������������h1/ � �@Q�;�.m�0{��wl��1��o�?��O{��Wl�����|�;����G����v�]s�5v�����[nYz��'���N:)\�����~L�RuL���G@@��@wi>y���������?�q� Q&�����Q��v��<�*��;_���Gb�z�� � �@��;�m�0�+�G�6S������>���\��.M�}��v�e�����>f��j��N;���sO���+��y( � � �@�I2?�o���~��9�[k����3��UW]������UO�a�M��W�_�����@@�_ �}�,�w�q��s��%�\b�������z��;�<�X�{��W��{���&M�d�rH����;u��@IDAT��S�sL � � �>H���,����-����J+�Yg�ek��f��mT���z:4n��<�*��;_�~�#� � �@�����K��G��`��>P���Gy�����v�WX�>}�{o���-X� p�����n;v�-���a]��#/��b;���*���_��2�@@@|���[UFC��5*��M�-[w�u;�j�b�F���*���|b�N��@@��T��v	�'�x"�������J+�v��w��9s�����z����1~�x{����3�0��9a��6mZ����c������)Sl�m�		�j��*f@@X,��4?��8�@��c�	%n�����tX�a����f�[U{?V\m�j��zj�
������@@zF �]�l������0$������
6t�P�{��m�W,m��'���S�����7}VC6N�>=\a�!D��j�0�z�U{�T1 � � ��b���V��=��y�����7����ZlU����j�U{?�Skl?���8?� � � �3����f������_|��[n9[m��J�j|?>*����/��k��Fih�Z���2� � ��+��4?��"��Q�W#��'6�kSm�q>@@@�gR�W�&�z��� � � ��|�����w@@@�*���H�uU��#� � �u
� �O�YC@@�!���H���c@@@�;|����cY�� � � �D �0c�@@@zI�i~����b@@@��H�_$��b��� � � ��>H����V�� � ���@*�"a�*[��@@@��	� �OnEh0 � � P �T�E��@��"� � �@k	� �O��Z�6 � � �@s	���ef����7s��A���E^��0� � � ���-���X��T� � ����X���Y��@@�a��0�a|� � �m+�c�����Q�G@@@��b��>�t7��!� � �m%���H���.��"� � �@3	� �O7Si � � �j����Y�me�@@
#��4?]��� � � �@R�	�nH�� � ��!��4?�k�Z � � ��)���H�5���U � � �>H��m���" � � �k����Y�m� � ��.��4?��.�? � � ������Yw�S7 � � PA�i~��Gx@@@������YQ�8 � � P����t���9@@@������Yu7�@@@�E�i~�[F� � � �A �0c�@@@zI�i~����b@@@��H�_$��b��� � � ��>H����V�� � ���@*�"a�*[��@@@��	� �OnEh0 � � P �T�E��@��"� � �@k	� �O��Z�6 � � �@s	��/f���h
 � ������t�� � � �@���/f=�X  � � �D�i~@@@��H�_$�����@@@��>H��?�� � � �]H�_$��D��@@@��|������O"� � �TH�_$����> � � �M>H����8�E@@X,���H��k � � ��$��4?�K�a� � � �@[��/fm��YI@@hF���fl+mB@@ZE �0k���z � � �@�|����"4@@(�@*�"aV�
HS@@@��|���[k-Y@@@��R�	���F�@@�H�i~��XU@@@��R�	��,@@X"��4?� � � �@�	��/f��M� � � �@E�����M@@@�.	��/f]"�� � � �@�>H����X�O��?�&O�l}������/��Gy��N�j���?����m������}��[n�0�SO=e��w���;�V_}u4h�
2�S]����~�
�{!����Wn9�� � ��H�_$��L� � � �@.���\.�L�-��������{���C����.��&w�q�]u�U���+�'?�I{������������?�p{���l��q�p�B���x���_�������c�R}����~���D����/S=O@@(���9�C����<�A�0����M�>����Y�m&� � ��"��4?�����7��m��66k�,[a�,�0;���m��v����eo���}�{��g�}�~������nO?���5��[o={��W���O�^x�.��R[e�U���#+�w��gW|?��������;_�~�#� � PD�ko�k��1/��}��o��8K�_$����@@@���A����%���O�z�ix��~����X*av�A�j��f��w^��'�t�ix�s�9�N8��a����+�������^kGq�m��v!Vm�I�&U�����&���Vo��k>
 � �� P)Y����f����Y�R<"� � �=,��4?������UJ�]p�v�������j�������������!��[��	�����C���>s�L;vliX�i��U������ZOi�M��W�B��/[?�@@(��n�M����#�&����;gJ�_�,c�_�n�����) � � ��'�4
O�n�X���+��b���=������+���i�o��v��w���i��!C��6��y�LC >�8���A�4����n��W=������O��]^���G@@�������7�������
������3m���q�a�>���0�<"� � �=,��4fK��x�
����m��)�>go���}�c�#F��3��k��c�=���r����|�I�����
�<�n�����)�U������So��k��f]��� � ���%a���+��_]��E�X�lqO�]�b��y���33 � � �@u����,��z�����v��g���?z���������6��l�V�o|�v�e�����>f���=�N;�4�s�=�?�D�6��7�X��X|�S���w�X/� � �U��I�m�����,�1K�_��,�&d&@@@��>H���_R��X.a��)���_��~��~�-��u�]gt���G?2�'l���*������I���C��v��z�!;���*�w���W|_�dK�z�.?Uvy<G@@��f��o�L����$�r11 � � �^>I���A�\������/��4�b�����������>8��l�����c����.f9��s�����/�����g����n��;���*��zT�z�-[�`����������XX�!� � PP�Zf����k����a���F � � �@;
� �O���O<a��_���t�M���o��{��o��������/�`�w��C�#���?n�?�o~������g?���l��i�f�����>�w��6���GY��0aB����+?~�=���v�g�����sy����@@��
\{�\���y��,����T�E�����M@@@��|����o��[��d7�pC��G��
6� �7c���O~b�<�Li�u�Y�4D���nj.�K.���O�z���J��V[�����S�L����+�8q�M�:�F�e���o��c�<"� � �
���j���FrU�)Y���/f�M�� � � �@�� �Ow�����W_}�^z�%����������j/��������;����j�U{_u������
+t�>������u��' � �L`���th��A#l���k�X2�y�z�I*�"a���E#� � �@{� �O��
k� � �E�&���r���g*�"a�U�9 � � �C>H��=�x� � �4D��;������P��}m���:��,OR�	�f�:�@@�N�i~�� Xa@@(�@*a����m�a��r�R�	���T4
@@�A�i~���uD@@�� a�:��5A@@z\�'��t�7�"� � �]H��l������~]���>����a�}��� � �T�A����!�D@@�L �0�<~�&k�����/fK}�B@@zT�i~�G��@@@�.��3�N�8�S
$�:�� � � �@J�'��tj^^C@@�Q�h�/�a*���Y3�]�	@@�B�i~�-V��D@@�� a����@@@zO�'��t���%#� � ��	��8����������3�����I*���Y3m!�� � �m%��4?�V�, � �Z`���tj3��L�M�_$�:mF^@@@zF�i~�g��R@@@���0��!5 � � �m-��d~��QXy@@(�@��e��1#6k�uH�_�0k�MF�@@@��|���[y�Y7@@h�T����_&�T�E��u�K�@@
&��4?]���� � � ��$��t��� � � �@#|��O7r�� � �t�@��e�G�M���E6��T�E���R	 � � P����t�5�	@@@��R	���w������T�E��FDfG@@%��4?����@@�.��s��)gt�~���6f�f^k�'����Y3n)�� � �m!��4?�+�J"� � �@��z�2���/f��i< � �Y�i~���D�@@@�=H���vf-@@@�n�I2?��f � � �@FN�n��Y����#�&�ux����/z�5���M � � �>H��m���$ � �Z`���tj���;vz�_H�_$��qK�&@@h����XyV@@
-@������#� � ��#��d~�yZHK@@@�����
��������M�J*�Zf��Y�|[
�.Z��e?� � � �
�A������J � � �#�������}�����wu!>�u�e���7��{������?���Zk-�e�]�#�HtI>N�>�~��N�m���������z�)����l���������w��!�>� � � ���H��? � � �@QZ2a��'Y��d1`��\�m�����z���3����k�������K������g�m���N�:&N�h��l��W�0����o��������q��������������/�l{������>�@@@��_~@@@��R�/=b�m2�_37���T��L;&����
�z��m���!a����~�����a���,�e'�0S�4=����#�����Q�F�z��g�����|����/���^j���J�c�� � �m*��4?���6 � �D �0�<~����,�e�,���=z�m��v���g�.=��0S���N�v��;���gn��V���k��#�������: � � ����42 � � �@�
��3�N�8�C��kcFl���f~���H�-�b���z���a6s�����O�>aH��:m�4�0a�|��6t���~�������%&@@@ 
�xBC����>� � � �@�	��e�L�_m�0S`z�������M�6�h����{�&M�{������{���~���w_�:u�]u�Uv��G��!CJ���7�4T������(���	�+�� � �=+���6|�>H��
_"� � �
 a� �f����o����!��dW����3�<c+�����������w�����lo��]s�5v�����[nY���'���N:�j���rL�0���, � �4X��Y�A�@@
)0r�t{d��m=b�m2�_����I�����a����9��ck����y����������z��p�����=�P�������t�X�s���N�=��3����� � �� �O#� � �4��n�M�����w��Z3�����6a��ac����VZ��:�,[s�5��vJ���1��?�x;���l�=��+��iG
�x�!��N;�_�@@�0n~*`�@@h&��s��)gtj	�N$�xA�$�5*��o�����l�z�-X��l2���_C0����v�����nk����;��]v�P�j��v��[�~�����E@@h��O����.�	+�?�M�<����c���_���5�|�A��?�a+���m���6t��0�SO=e��w���;�V_}u4hP��I�J��W��XO�1�������� � �4�@��e��1#6k���mS*�j�fJns�1!�����l�l��a������3�8#g���i���~�������w�a
~��k���l��	6m��p3����G�)S��6�lcGyd���@@�A��nU�E��]w������������{�UW��k��W������W����/7��y��q�p���������K�Y���|����d�~.�|��y� � �@�
�f�_����Y��lW*�j���+��z����;I$]�����m����7�j��fJ�i(Gy*����_��
><�(S�v�%�����M��4��V[m�c���@@@�� �O�yZi��o4��E��=�^c���<�����5��.�����#������z��.h<�����^�K/��VYe�0���>��\���������8���I@@�Z`�������H��G�>�Y\�7�x�4|�%�R�����_|��Xc������x
@@�[�'��t�����W\�6�xc��w�z��f�G�+��l��D�	'�F�8���JT��z�]{��v�G�v�ma�����5�H�zJxo���W;) � �I`���tj���m���%U*�j�f��$/ � � ��$��4?�K�����K�����v�A�Fmd'�x�i�E�}�C��D
����>���������3������y���������}�S�{�����N_7� � �4�@*a6y�����N�K�_$�:1� � � �3>H��=���]J���F���w��-���u�h�L�w�}���ov��G��!C�s�7o��0T���?�������w���W�����&t/�j��]��� � � P��s��)gth��}m���:�V�'����Y�mD@@���A��n����T���3�<�k|������O~�,X`�]w]��_��]s�5v�����[nY���'���N:��
fx��v�mU�S��Z=��7������|])�=�XW>�g@@�I�7�lw<�R������y���d�
7lt����H�5��
@@hM]M�r�����
���}����W����t}��S�f���c������?���J�}��v�e�����oW_}u����?�������N�=��3�����d����������y�U�������Y=j|@@�^f���9@@h�ko�k��1/�>�[�H�%i����d~��[`�r	��_=�Wl�M6��}�{�5�1c��s�9��>���f�O��W����k�&M�C9�v�i'{��������J�]~���W=������O��]�@@h����#sth���m��:�V�'���fE�r�@@�^��,��"i%j{�A�����b�].a��Q�������^j+��rXA
�x�-��!/��B[{��m����������L{������/�~���+��z�U������Z�*������k��f�z�._�Q@@(��n�M�����w��Z^H�_$����h# � ��(�
�R�)����u���|���{j�=��'�x��������t�M���o��{��o����������o�����z���>e;����������6n�8�8q�M�6-��L�1{��Gm��)��6���GYZ�	&T�����������>hg�q���	y?�w��@@hV
�����G��	/ � � �@+
��]���Y�����d~:
��SI�n�!����G�l�{��w��+��[�hQ�E6p�@;��Cm��W���%�\b��O��VZi%�j���p�}��)�_m�j������:u��5���������yD@@����c�]{���8����Y����$���([�v"� � ��$�}���}D1��^�
��A�����4�r�����J�-������j/��������;����j�U{_u������
+t�>������u��' � �4�@*a����m�a���������H���cN@@�N��I�m���7u.��}�����i~��'x@@�y����#s:��$�z~;�D@@�%�r����C�,�R�5�$���?�� � � �@�
��m]�a�S�=�zv�bi � �J���YQo����i~�7���@@@��@*aV�0�0�nu�#� � �@I 5N}�M7A�2�Q����t
U0+ � � ���)�kcF������Y��FT� � �@�RWf��dYV$�s����50' � � ���)�|�2i��/f��Q; � ��HEZ]I��2f���~X�M���W����t�5�	@@@�{R�!	����V@@h2�T@�&=(j2�W5�0k��C{@@@@#'N�G�,��1z�`+������f61O@@@@��b,�M��q�� �O7c[i � ���@*>,zl���H�����Z#� � PV��eei����t�D� � � �@3���S&�����L���5k�_�A����ux���4 � ��������1I�8.3b���gi � � P] #���cFlV��M<���b3�!a)xD@@��PR)�����e}�F��Y��B@@�}R	�V�������eH�(�# � �m.�
�D�
�P�n��i�?����] � � �^#'N�G�,����#��/f63O@@hO�e���}���{�5,@@�(��d������~g,��T�E��`��"� � �� H�a(���^Z�����9�B@@zO +�B����H���~��@@h
z���f�A�����d@@@`���9����3:p�������kE|���H�qK�f@@$@��A�uV��4?]gu|@@h�@*^l���	(�0k��CE � �O 5������(���A��.B�i# � ���	�e�	[x��E���Y;@@�\ ���U�,���I2?]���F@@hm����#stX��#�&�ux��OR�=���%i3 � �] Y�E�}�i~�A�S
 � � P�@jD�V�$�0�{W�� � �W �hmZ%�)���A��.J�i' � �����9����3:�\�������Y��� � ������g�� �O7Oi	 � ���@*n8����YKp��/f-�iY	@@�	��}�����k�\>H���^�!� � ���b�V�S�	�Z��E@@����\�i~��ZIk@@@��FN�n��Y�a�I�u��	 � �U u��������m�$��E[�� � ��%���r������~-�����f-�iY	@@*��a3���n�����tw/��@@@�J�����;V�H��K�_$�
�	i, � ��	���PM�t�`}2��)����mKG@@�vH]p9p@_3b��aI�_$�Zf��" � ���1���� �O7C�h � ���@*�l��IR�	����Yk@@�6H]�Uo����9}�����>�@@�/@���mH�V���5@@@ 
0c�h�G���l-�B@@�vH���6�*���Y;���# � �@[
��
D�
�Q����4?]�u�� � � Pl�����i&����kE~!�0+��� � �TH9�X���A������H@@@��
�O���@@
)P�wY�
�Q�����d~���$ � � �c�X�/�L�_�0����!� � �3�GKf(����e)>H������ � � �@�R�d+������Y��"�A@@�IRC1�i�6�F�pw�>H�����;���?�����o��S�8�"�)%2)��)�����b�(K�!!pd�Wl���
f���H�B)������9g��t����;����U���{�Y>�g*���{n��d@@@���s
eoQ���$��^�a@@�/`�P�:�_��#4�df9���]���D�}�]IHH���z*�d8 6l�u�~�ii��������u�:tHZ�j%�����W�Jm���x�����p���� � ������x\����Ym���@@�(
�,�"n��6�4��������n�T���������u�k����u��%;v�;vL��?��h�B���/3f����
III����Kyy�4H�m/T�P��
��=/�z~��� � �@�	�f��Z�-�"aVg?;:F@@ ���F���Md���53H3�u7����j�*Q�z��-{���O�K��������K��o/�}��xf�����G���)S�s��r��Y����S�NI^^�4i�DO$T��S�����J�v���S��}�@@���E%�R�N���R�dZ��>��a��0��+�@@\/��e��0�4���	=���Bi����~�����/�'�%�T�K����;�y���~�z�0;s���3F���>|�����_�~�m1b����G'�B����
��������n���z|@@p��-�����m�	3�� � ���5��x
l������ �,;r�T���o�!*���k��J�yf����9s����C�o���Q���[�O��]�q��m!��X�"d;���iW-n=���F@@��l�e�������Y]���@@�	�c� ��3H3�u0�Z�2X���/����l��o~#?��,Y���0S�(�����K�^���...�Tb���e��!�~�z�����������n���z|@@p����B�[T�3��������/6l�	�x���@@���;F��o���r���6�@	���/���{�t����9St�����g�}&��/�^xAz���m����2n�8����<��3������������p�
�U�&������t�E@@�G`L�Q�m����������]�F�K[�u��-_5{JOO��W���6�PF@@�,s�E��<A���L��J�y������;��C��/�����������$??_��L����9x��������c���6m
Yo��U!������]5�p�y���7	���q � ��M���
�{�d�C��0S���C��#�7 � �1&�R�1v�,�%aV!s�������(��?���������d�����6h� ��������a���}��';v��Y�f��h����U;��p�
�[����� � �@m�n�LKM�i�w�F�����y:gIF�� � ����-�Q�g)������	��
f9�fQ��z�L=5u���R0*��W_������-��"S�L���������^�zzj��]�v��y�$99Y��9#�=�\�z�F�
z\��>/^���Ri��MX�����}�	�A@@�l1f<������Y��@@����N<3�q��s� �,����3�o��V���k}�����{��x�	�}��wK��M��.Y�D<�0k�����3G�m���a��c��i7n���{���#�m��������g����e�������y�����7 � �u%0>�P���t�1�-�"a�s��@@@��,���k�� �,�{~��SI��+WZ����-]�t��$�����y��RQQ!������B��Wbb����S/�����m#T�P�=
��������lKII�X����F@@��lqfvf�tK���u=�H�o��H�EZ��@@��O�E��6�4�\Cqt�W�\�O�5l��g��.]���OK����K3�T���^����.H�����} � ����-a���{ks���-�"aV����@@������9�L3H3�N/�B@@���]T"/����XZj�L���g_<m��/f�t�� � �@\���S�������^�����x�7�C@@��n�����)y[�E��9�IF� � �@@[�*�{$N�A�Y���1
@@�[����-�"a?V�� � �n��J$�7\u3H3�n�;sD@@g��)��E�>�����n��>��i��0��+�\@@�R�����c�_n3H3��?3f� � ��"`[�?��M[�E�,V~��@@����1O�������r|��Y � � ����hB����#Hs����!� ��E P����$�2�tC\�����r\O��!� � ��cl7i�!���_<a���%A@@�W �R������B|o�A�Y��Y3;@@p��-a��Ml�	3��* � ���-hQ���q_4�4��g� � � ����n�;m�	3G�$ � ����.�aY_���2�4��3g� � � �[���UMl�	3'�" � ��@������(�A�Yv���$ � ����-a���{�|\��-�"amu�G@@�
��enX�
LqS���r�L�� � � ��cv��K9;}����Ml�	3�� � ������>5"7��W��u������
�"� � ���l7l��fM[�E��M�~�� � �h[�������'J�3�4���h@@�
�bP����������A@@���*j4n	V�N�n{6�4�\���w@@p����B�[T�3U��;��0��)�� � �@��c���u�f�f��z\�� � ��/`�C��:[������?�(�7o�/��RT�m����H�v�B�/���#�u�V9t���j�J����W�^��B�V�� � ��u�.s�O����{E�9 � ������y)gg��H�$n�&L� EEE���$W�\���2i���L�:U:v�hh����/3f����
III����Kyy�4H(����� � �"$���+0�/��nf� � �����i�I2-��hw���m��+�0[�x��u�]����f,��>�L���'�<�L��5~�x9z��L�2E:w�,g�����,9u�������[��M�4	�6@@�)`[CI���>w^��fmif���@@@���fnz��-�re������f����_���2z�h��z[%���#��s�>�[�������~[F�!���A�����{@@l�RqS���_���e��0@@�����B�[T����bQ[�E����A��L=!�	�m����9sd�����o_��h���2}�t�,��+�W�6�A@@@	�,�w�� �,�� � �DS���Ivf�tKM�f��i��>av��Uy��W������mv�m�Y/����e��%����K�^��u���E-���Y�n]��C���G@@����D����~f�f����l@@@��l1���Q[�����Jr-]�T'�T2,���>�����/� =z��V;|���7N?���j��{?���0
����U@@�*���rY������uo.�woVi?;�!��k�����r�;�A@@@�����y)g��GZj�L���g_<o��/W'�v��%3g���m���I��i�����M�$??_��L����QO�������c���U��Wu"�!a	E�@@�F�dY��G�Wf�P�
@@�k�+���2�O���8p@�M�&���2y�di���q�rq��2k�,Q�!4h��BAA������a�d��EA��w�}��( � ��S�����p������Y�A�Y��f/ � � Psf$���"����)StQ���S�N�cf����RZZ��ig�����{N��o/��O�z�����	5����y�d��QA�''��ey�!e@@��lA�:��;�~�d&��22 � � �@������R���33�[�{r����={�\5U�������>��*1]�p����w����_����K�������4����e���2q�DQ����3G�m���a��c��W_����w��2r����}:b@@�U$�\u����'H��I��ba�PX.=^IDAT@@�*`[��m������u	3��b�/_��|��1���YNN�|����i����������Kaa�����r�={���1&$$�<�� � ���$
�mA��.z�5�4fa�Q@@j$���D^��Y�
���f,������d>��y*����w��z"�a��>��t���>}ZZ�n�]�����Y�2 � �@��tY�_�������ru��<@@@ ��->MKM�i�w;-����/fqw�� � ���l���-s�U����A�Y���3 � ����-Fuc|j��H���/��!� �����e�jPn[���/�C`if���eX � � �����?/*	��q3|@@�
���S�uc0���T�#3�df�nGE� � � ���kggfH���x��u^�������� � ��L �����6|�$��l3H3��=kf� � �u%`K��q[�E���~��� � ��e.���x�^\_�N���r
��t@@@�����N���i��H�U���@@��	�c���t���e70W@@�[���W��/f��;�@@����SSw�]{.��������j7#'�������+			��SOU��?�(�7o�/��RT�m����H�v��u�9"[�n�C�I�V�$==]z���=�)�������p����l#� � if?���/f?�PB@@���D5�R�5���� �,��d�M�����a�Y�|��?^'����[i�&L���"IJJ�+W�HYY�4h�@�N�*;v������3���BRRR����R^^.�
��z�U/�qoC~�p���_�l"� � ������/fQ���( � ��%�����U�������+�W�Z%�_���e��=��aC�%�/^,w�u�����������>�~���3�<#�����G���)S�s��r��Y����S�NI^^�4i�D����p������p����o�m@@�!����X��wg�[i�v��/fn���@@�Vl�����H���x'f�f�c|Z�_XX(�5��o�]^|�E���-a���Z�1;;[~��_����e��1r�=������U��}y���e����O�U/777d;��T�-T������A@@�6�>����-�"aV�D�@@�{�.��K�	�A�Y�Jgk�*	3�>3���z����n�9s����C�o���Y���[�O��]�q��m!��X�"d;���iW-n=���F@@ ��������_$�����M@@W	��R���Tk�f�f���X��n�L����W^���z���#Gd��%����K�^���...�K5���_�"���Yo��u!��vp�N�����������@@�p>�^.����T���������>'nt��5����_$�"�L� � �n`)F�]������r�z�^K�]`��o����^�?���}j�7�x��n�Zn��f���[%##C��Q�	3��Z�t�N��$�| ��/�^xAz��������2n�8�{�����_���vp�N��=k���o�*�$���E]@@����;!�|w������A��!a���@@�j�tY5�8�+`&����B-.]���d�6m���8�a��__._�\iD
4��0`�t�����@;�I����Kf��)m���I�&I��ME�/??_��L����QO�������c����[�jU�v<�{��iW�!�z�v�F@@ ��=�33�[jr4�ul������c/C@p��2�_!�����r]�|�����"S����DW��U�V��eKi���\�pAN�>���<yR���#;v��������
&��w�.��O����d��i�����'K�6mt���Y�f�zO��A���Hnn�x�N�E��l����B8�*�p����6 � ���%���}o4���6m�	���t@@����C���A�����dif�.�<c�Q�����~�+�1\�xQ�����u��q���O��N
�0S�+N�2E/��[��S'o�g�����{N��o/��O�z���c�I4�D��y�$99Y��7j�������KKKu�.�v��_��� � �@�v��K9;}�IKM�i�w��s��-�"a��_sE@��O�E���
�A�Y�W�o��V���k=���W�Z��'���w�}�^rQ=�����N'�~��_TZ��_�~�x�b��m�~��z��W_}%7n��>����3g����B��3{�l�\���uR1�������o@@�!`�_��f�/%��D�����YL\:� � �$[�������I�(V�bif���W�7SO]5l��JCUI��+WZ������1�\������p?���d���RXX�����(={���1&$$x����Z/�qOC999��������r���n=O?|#� � 
[�����0����6@@�Z��������� �,�� t�bj9�'N����K��[�l�K0@�|��-D�zBM�S�u��>���9T�P�U{��7�a8��s�����A@@ ��s
eoQ�O����-��K���/f>?6@@.`�3O�������q4�����@�ksAA����������������+��+�Hqq����K���}��� � �������wo��/f���2@@��,s�����A�Yv�T�-[&k���arr�;O?��#����,YYY�����!3@@��]T"/����+B��m����|H�y$�F@@ ����<u���l"`&��r�Sj���U�D��9s�t���R�k���+V�eo���J��� � �u/`��3-5I�e�Y������/fuxA�@@ vlA�=K1��5t�H� �,;a��v��3fHjj�>\:t��������s�~���������4i��!3@@������"��������M@@�l��C��/�vU� �,W��h��������n^-�x��7Jii��={V�S�6S�8�� � ���S({�J}G,K���� � ��
�c�R�����$3�Um'Z�/]�$�]e���8qB��ej�m���x@���+����V��� � �5������-��{�k�UL�n��x�,�.!�E@�m�.�mqw�gif��
���-Z�$s�bL � �Xl	3���f��
�@@,@�,�
G"#`&��rdZ�\+eeer��IQ�2�|@@@����J����>MKM�i�w��s��-��	37��3 � �@X�;�������0�4����R���X�,Y"������r�-2i�$]�6m��t�M2t��Z� � � �@�l7������/fU�}Q@@�%��BM���%?�Z�����Z�>h7���2n�8)))�f��IBB�$%%���S�y����y�fY�h�$&&m�� � � P�����������g����eJOO���e�|@@p��m�
�@`��_Ct��	���e��XL%����d��2x�`Y�p�>|��0��e���7O?q��<�� � ���S({�J}���!�RYf���<@$�<|#� � p]�T�C�D"-`iNK�-[�L��]+999��e,�I�����*c������H�� � ��P���^1�OT3�0�$�G�o@@�	���P0<]��#� M��a���WU�|�����w�����K�-�?a�f�Y�b���1C:u�T���� � �Q�r
	����/G'�N�:%��o����^�?���}j"7�x��n�Zn��f���[������u�4 � ��@EZj�L���L��� �,��8l����O&O�,iii2|�pY�r��	3�t���s�i*���A[�C@@�:��Jl�����_�K�]�tI'�6m�$�w��w���__._������T���_�R����K�J��� � �@(�b%��H�A�Y�t?�m/??_Tl�>j|*>k����?^�:t����W�� � �8G��0c�����-�r\�l�����"S�������t��AZ�j%-[���������������'E� {����m��ar�}��4sJ � �����p������5���\�"�7o�?�P�;&/^�F�������={F�7�B@@�H	�n%��I�9.a����������[��m���C�T���_���[��;��>}��8�� � ��$`{2�U��C):f�f���[�ZU��s��I��Mk�g#� � �@�l1nvf�tKM�z����-�r\�, # � �@|	�tY|]�X����e'���~��}�k�Nx����cC@p��-a���{�L�3w[��	3�~�z��a��>e@@�pH��+E�h�A�Y�F_Um���D�x�
�z��=e�~�������]f*����=zT�i�#� � �@v��K9;}z`��f�G��<�'�TBl��	r��	Y�p���W�3v��e���7O O>��w?@@���u�����p�W3If�k�f��]�v�,[�L�N���/�h�F��1YFF�^_%�����M��B�@@" `�1�������/�'�


$77W,����G����+�Hqq����K������@@*`"�	A�8A3H3����M-Y�D6o���(S��\�RV�^-O?������������e��)���Z�~8@@"+`�u�s}�m���f��FugcNN�$'W~�G}$���%++K���|g� � �l��J��Q0�4�������5kt����_�F����c���K��F���_����_=q�@@�!0>�P���&;3C��V���Tr��-�r|�l��U����9S:v�X�r� n��z���n���qv � � ``)F�
�j[���rm�����C������<��?/�<��<����'VSI���;��` � ����-�����9>a�k�.�1c�^�c�����C=+�"��;w�w���M���I�&�3f@@�O�YP�U'f�f��d0�N����&Es���S�*			�I�#FH��Me��Y�f[�` � ������y)gg�I�����/�'��^{�5��?��g��e������T��=���w��w��A@@ ���PB�M3H3��9�����%��eKi��yu��<@@���-�����m�WL$��:��]e���8qB��ej2m���x@���+����<c� � � �'`[�BU�n;?(6kE���r�tN' � � w$������+&f��***��e-Z� If�PF@)`�I�m��
Q0�4��������'��O>�#G���s����5J�J |@@@�������R����p�
[�S	���2Q�
��*_`� � � Y��#u'`if��F�S�%%%�������io��z�Y�v�*` � ����mUVT�|l�WL$����K�,�����Y�r�-2i�$]�6m��t�M������@@�)`�~!u)`if�.���{��u�t�RIOO��\��i#�����~��?� � � ��3l�/qo�kc���0+//�q������Y�f��� III2u�T=���|��y�,Z�H+��= � ����2��`if�	V�2�4�	2'\
�� � �@h��E%�R�N��i�I2-�N�}l���/�'�T2,//O ��������	�-[���y��g��3> � ��$�L
�N0�4���qHnn�<���r���;aH�@@���_�_f��_�O�-[�L��]+999��e,�I�����*c��������� � �Z�r
�%)\��p��� �,;a�W�\�7*<xPF���f����>l*�C@@��H��on���0{�����w������Z�?a�f�Y�b���1C:u��5@@�^�,�Is�]�_������e�L`��
z��`�a��`:C@@�������R��33�[j��>6btI�}������%--M�.+W��>a��.�;w���*���A�3 � �Z�d?�X0�df�	c���L�8Q�I�&��uk�W������G��W�9� � ����m�VW�_[���'��T���e��MzVjW�^����������C�J��}u�� � � �l���O���8E������V�P+~<�������&c,))�w�}W�����6u����u�:tHZ�j��������n�������:�i��;������>� � �TE`wQ������)���H�[�	3�v������?�c������Q�F��Cy����g����@@��<]�����6�4������=V�^-3g���;FdH��G�������
�*�Y5��@=����������9~�������A�d�������P�B�����y���o�m@@�*`��yA`E[�	3sJ*yv��9i�����2 � �Z �]u
�@�(`if�	c���od��	r�����a�"2�U�V����wo��g�4l���0?~�=zT�L�"�;w��g�JVV��:uJ���d���A��%$�'R��O>T�������>� � �TU��Y��l�W�%�<S���������];ILL���@@����+��p����7�4�����%����?IAA�^_�W������^Z�v��z�Z-���o�_|Q?A����J��3F����kO��������2b����
z�O�>:��v<�{��_���z|@@���-���������	3���o�!*������O?�7�|S��L�����)=z�<s� � ��B�vG��8A�+.LN����&������B��K6J�m��M���#�����{�L�>]/��b��������j� �v��_��� � �@$l�����������	��k���e���j�z�@5j��Ngdd�_|�]�C%�� � ��S �R�i�I2-�Nw�0k��A�Yv��:$;v�9�|P7n���@	������%K�����^�zyO+..�K,���_��[���!C$R�xp�N�����������@@%0&�h%�W�w��/wt��5����_�O��@i�����2%�y���O?-����E=m�x�b��}jjj��h@@ 6l�O��ggfH������t����e7@J�}���|�ry��|V9|���7N�����a��?��3�v��E8�����W�	���q. � ?o/����|&��}Cy��6>�bu����+�f��${����Z�c���K�.�5�����	�?�����'�� � ��O���w��e�f��,����;�����g��/_�/��?%�6m�$�����e�=f�������_��{LV�Z����v<�{��i7��U=> � ��T��j�������={�\5OKOO��W���6��jY-����
�T��#��O<���%�O%�:w�\�c�3@@g��jg)Fg\F\��]����c�g�}V:u�$�{�����K��C?��d_~��l��U�����Z����>��G%��2��f��~�A�y�*((�7N6L-Z��}������D;�\/�3�p�W�� � � PSfU4c1���O����e��f*Y��gO�r�������#�i��:��L�39�@@�_��Y�c�_�x��'�qB�,''G�l�"��I5.��I���e���_RR�����r��i9u���<yR���+)//��B�S�������	�0;s��<��s��}{�>}�~�jp����k�.�7o�~�u��*��vT�/^���Ri��MX���8II�?@@�	�n"%&&&:�Q5T,��8�f���V/~V�Y���Ua? � �@�
J���D�^�8��'a�IR�)��j'N��
6��m����c!�����d}������[����>�~��|���������M���C���fH��3g�C�=�{�Tbn�������#G�<��`;�g��������u"1�����S�o@@j"`K��;���4�����g�1�0�L�o@@�	���@`���B�����������KQQ�>|X������sR�~}i����j�JRRR��?��7���U�l����������K}���B���/��������D���Z�Q%�B�t�^���v��w���L�2E�;�������o@@�*���D^��Y�4��J$>;l�WL$��R�|��9rDi>���1j��������>@@�������f��������5�����Kz	���[{�f4%B��
U/�q����a���&�w8������i�
@@�������9>aVRR"YYYz��`ST/sn��]�*C@�[@��FP'�E�0�4��"�� � �5������Am���f�����K�Jzz�<����%�j�|�O�-�w�� � �,�����2�4��B
�� � ���S({�J}�$a��a���_�O��d�J������@p����9�@@���B\L���r\L�I � � �@T<�/����W�����d�d���0+((���\y��g����47�#� ��@�d�.���hif�eL@@�(���C���b�Y���|����1v�S�9>av���:u�<xPF���f�]���D�n�!� ���K1���d*Z�����x������������9m|�@@7	K�yH�y$������	35�
6��E������%��p@�y�.��K�,f�f�-U�dWqq�,Y�D��������[d��I�<m�4����d���u26:E@@���n&������n������&`���0S�����l����n�Z����sA��F�]�;KJJ��w����y���|��m��]�*�����k��z��#Gd���r��!i���~"�W�^��a � �@�$����fl	�A�Yv�,���e��q���f����)))I������/�7o�77����c@@p�@8O�yH�y$*��/�'��y�y�����G����e��z��ko�SO�-_�\��?�[s��
ybNN���o�����}����_��_�]�3f����
III�����
2
$�9�
@@���=����7��3� �,;a�*���'��������������-[d��y��3��@@��@Uf�?�v���;�?���9>a�r�JY�z���9S:v��U�V����wo��g�4l�P�M�}����g���������2e������={V������S:�TO��A@��O�U����%`if�	�X�l��]�V�?��e,�I��8^}�U;v�ddd8a��@@����-�=K��'�1�d���0���od��	r�����a���
GT`��Q#������_�O��4a�ec���{��G������/o����1B�����O@@ ����F��m3H3�N��Z�C��1�|i��E����5kd���V�������@@�{��E%�R����I�,0�-�r|�L-���?�I


�o�����f�a��������AvV5a�{�n��z��Z����m�6�3g�~������O�>�e= |#� �U`)�*`Q5&<��Z6�,;a2������'�L��V��,��nB��t��<k�����@@�{��$�x�A���-�r|���?���z+�������UI�������g0*(���~%���o���?�%K����?/�z��T���bQK5���_���O@@ �O���h|�A�Yv�����e��Mz8j|*�����nlT��C����A@@�Z�=f<]�R��/�'�:$;v�>�kG|�A����W�*	3�������r�?���������������/�^xAz����E��9n�8����<��3��5)����&�s. � �x�������e����{s��{�J���@mt��5���A�Y�xG�l���+�y�fQ72;vL.^����:�C=$={��f��� � ��t���=�e�������	�������$��{T��zo���������E�}��*Q�1�|<(/���<��c��gM�I��D�s@@ ����������@m�1af�����s��i���n� � � P��s
eoQ�O�?k�(�~{���-5���b"a���PK|T�s��eQ�6lV5I���R�;w����G��Y�d����}e������R���
�������o@@�����m��ep��g���0�4���a� � � P��}�������/�=a����J�N��w��r�]wIrr�L�J�}����u�V������g|�����	�0SO����J�6m�m���Z������o�s�='�������K�z��9*��k�.�7o^Xs�v�N@@�%�^^L��%?N����S(N�<)�|��9rD?]f��Q��ul0�C@@ 
�n2%aV5l[����YNN�l��E�LZ
X-{��cGi�������$?����>}ZN�:%*��������\k�z#F��N�@�o��V���k}x���r��%y��'���w��]bd����}�v�8q�~b���_��-�����=+�������A���es���m���w�����qm��Q'�F�h8�G@�.`[VB�~"�*`if�	�-))���,}a����6��k�
�@@@ B����L��k���0S�:q��l��A'����C}��\*A��OQ	�������$���+�u����K�.��J�}���2e���S	��d����ys�������'�***d���RXX�_�����_���c5&�`�� ��H �]r������jif�	�����K�Jzz�<���z�
�J��Z�hanRF@@�(
��c����[V5t[�����9����^�������RVV���_��4i�DZ�j%)))������T�yn$�j�E�}h��6u���_%�l���z�u����m��� � �O�@K1��&���;aB n� �,;a�*Y��f<A����@@	t�)��T��a���0��49@@ �lKJ�9p�\�]I�[U3H3�Um'�

$77W�{�����htA� � � P[��UY�hT��_$� � � ��/`��_��?�k�Z�c��A�Y���T����+2u�T9x���w2��m�=@@���1F���0��/-!� �TQ���F��0�4�����wK/Z�(�pX�1(@@��@��MY��z�����Y�,9@@ ���g)���DL�A�Yv�����/'N�CQ�p���Y�y��������!3@@�V�?�2K�/�-�"aV}O�D@�����}����ScN����&��;��{��'�>���W�^='�1 � � �J�c��e��_1�0+++��'O����1�?ZC@j[�����I������0�4�\�c���r�JY�z���9S:v����7 � ���@�N���!l�WL$����e��%��Q�[n�E&M�����M��n�I���� � �������e��v�0�f�f�#�K�Z���od��	r�����a���g!� � �@�l14�s�Xm���f���2n�8)))�f��IBB�$%%���S�F~~�l��Y��:11�fB�� � u���S���:=8P�����z��y����$��o_IKK��{��:N�d' � ��X��B����-��a��=W�f�������>��*�ZV����<0`�<X.\(��&��l�"����O��'�� � �8[����1KI8��1��x��kq�8-�������
9q�lID@@�-��Sb�j���X����f��-��k�JNN�~w��|f����������c%##�3/�@@(�}�.s��bH�"`iNK�:tHv����������G@@���-�&�����f,�9��%�{�=y��wd�����E�O��Y�FV�X!3f��N�:y��7 � ���l��!����.T��\.��>z�����]n��F9�C�<A�Z��,�:�� � � ��*-��5������	�}�������z�����+Wz�0SO���;W��DZ�
j�D � �Q����:b��pG����������vOV4����{�h����3/G�
f�f����@@@ .�xJ]��m���0S�����M�6i5	u��Z�C��Z}��_B�7� � ������]q��T�y�e*QVp�I�:���Tnm^,$�|X��0�4�v����.H�~�����r����=t��]B�� � �TM�KGW�0Pm[�	�+W�����E�t���cr��Ei���t��Az�!���g�9�@@��l��!�G~_�*t������f��C_�Y}~�\?q���d���+kv�� �,�kGwoEE�<����������E�S:�g�����c�P�8� � �U���B,]E��m�WL$�������s��i���n� � ��������`~��<]����;���H��B���$�-3H3����:o�����WO�y�)..���������+���H@@[ �����a�h��b.at�D@p�@�?��#�Q�)�`�M����~vSKI��4h{�0�4��[�n�������C���.����4����NT�[o������8{���_�^���[��v��2�����i����_-������"��{��l�R�u��X��<�=���u��� � ����ibi��o����H����?��9r��~��F0j�(INN�b � �@���W���:����r�����)r�LB�w��7��0�=f�*��� �,�>3�5>��SQK3Zrq����r�����;B�������r��q��J~���W'��M�����U2j����m�6IJJ���J��n�Z^y���x�V��C����?�P���4����e@@%`[��x:r�
[����YII�deeIiiiP�Y�f���� � ��3l����|�3�O�Qx�,��gIFf���� �,��B�j�J�����]�V����B��H������d�����O�f����INN�8P
�����CT��_�J233���>�H���?�s��_�%�qO�������_������]�@@t*�t�~����	�u�����Ku����K�6m����,-Z����6 � �@	��������T��w�+�G}�:Y�8�Sf���Tnm^,7��O}�a#������gF���������r��	����E%~5j���z����cz��9st��S�����-��_�?��?�����-�����_]y�y��'�������/]�$���!��5��3;��?�1�v<�����
��Z2��@����6 � ���l15�td����	3�,SI3� �����@@�h	���W}��}��#��Z��b���Wv������q���������J�f�f��o!�5'L� EEE!n���N`���"�Q�:�$�Z�q���r��7�xP%��%��+g�����\���r�j��}g���'��'T;�z>��q^�����6 � ��[��#����_�O����g�����?�"�� � q������8uD�_�������{w�'�T�L}X���R�����*7�������B>��3��2��j���O			�V��W����Hm���|�M�p��������[��[7o�I25��5��n�T�%����W�gk���~"-�zf�����e@@���nB%�����9>a�^�<u�T9x���92���������5@@��v��z�|T,W���-���Y����N*���.YPi	Fs0fBL%���r�.i��?t�~���7����������3���'��2D'�"�C���������k��rU��Z���G�eu������n�?��>
����j�`�����i��!�����$����[�����W�:u@@b@�����~{��H�uo.�wo���M]�v��tm���fJa��
�h��� ,���� � �@T�^wH�Y_�m�eAy��`�����u�,$��w�� �,G�G��\XX(���������&�{�����������o~�=	��3���o��3�3f�J�%''=��[o����������F��������l�0��u@@�����>i5jf$���W'N��&M�H���+-��K���
H� � ���@8�25�����Q�^B%��%u��nK��,��(`&��r�����&M�o��F�����UT���������O������?^�=*.��
�}�?�����R�IIIA�{�}�O=o�����v�F@p��1�����_���i��W@IDAT�������33 ��$A"���C|`r�lQ7�b����
���C|��,*���>X�����A�U�b�U�D�5(8(����������}o�9NWwWWU�2N����V_b�Z�j���6r�6��SO����y����WTT��P � ��U����Y���>n�T���?�~@j^��-JC���-(�q����&�yNdN������%]{����g���������w��}����.���R�;w�sS����E��������Nr�}��G2u�T:t������1c|���m��]��k'm��m�T���@@���SK7����1#�W��a�b�����m��*��g�yF�{�9�1c�s�1�+P � �@�o���V�*���/��d��x��o^�����nR�}��N���0if:���Wi�>��,]�T�;�89������DV�X!�����~��r�%�87�y�f�2e����9_�T����^jkk���o�eK�������Q�2i�$���r�Z3S����"� � `�b*���/l�����7o��2dH��(;<�� � �W ��Y�o��m����h�eI`��Q�*��m:���43�N��~m]]������������������*_|�3sL���U���Gqf��c]�v���GK�~��,^����u�V�8q�8P&L��/�U���c
e@@ �k6~)S������=��1�/`�>`�g�����g������x5xi��u�a�@@� 0e�*Y�a���G����Y>�V2�cP3X�9H3��"o�R��;v83����n�����S����}�����:�/J�o����L6�/~��:������@@ �,�������B0{����'���b�FO"2 � �P�z���?��Ov�*���/��db	���f�Ps�f�3Z	�!� � �@�X�1�]d�>`�i�&Y�r����g��f��� � ����o^��|��P��3�,���*����l�G� � � >�c�}���_�����@@l�����6�[�Q_��2-���
�����Zai�-R\>�5's#`��tnj�@@�����#�W��a<� [�c0��6�"� �"���n^�G��K(��Y�1���.����l�K� � � �c�}_��_�����+�o�>6l�|�����zj
8PZ�n��� � ��l�r3����\����~�4�l�G52�'�N���fV�Y`	�,�f�hs�f��\-�#� � �@H}I��d��l�������d������y�d��e2�|O�3f�1���� � ������]�`dyo�\�`VY.��W�9H3�����@@@ L�/�2��~��_��)�GyD������/���jy���=u�<�Li���g>2 � ����X��b����x����	��U=���Q�os�f�}@F@@�k�c��l��P�L���Z��i����_����SN��O>��o��-�q@@ 9�`�Z���q'&W(�3.��'
�@s�f�k#� � �@�X�1g�-*���B0{���D-��h����{N�y����[(p@@��K0� 	�?��J��������vH����C9H!� � �@�
����r���m��+�f��f�-���';��rCI- � �@���@7�N�*S����Lz��Q�P�5a�%=�����$<���	��43���"@@���1fZ�y��WhfO>��l��Y>��s���O������������m�|��g��Y�fI�.]b��� � �.0y�*yocM�L|�-!MNO�%��Ls��`�+OhO��43��0@@���1f�1�Bl����n����q�������9��#��{�g^2 � �4	�c��X�1�*����43�JY\� � ��#`[�������l��Vk��=h6�������1��,9I���J]]����k�3�&M�$G}tL��[���;JQQQ�qv@@��(7s���\Z�*��� K0�9�Y��e�K�=���S	 � �����*/�{S�c1]khf������_|QF����q� � ��	0�,9� s�Y����).d3�;� -,_^��mQ � �x��_���!Wq:S�XL��%u�"� ��/�hmt����HU�
���,�XV�x-��l�A��Uf��Qe"� � �@8l+��cn��6�"`��>�6@@ ��?��F���\�%���fs�f��lu#� � �@vX�1��~J������# � ��,��?��gVYI��,��?]���� �L�.�� � � �w,�|���_���Z� �d\�Ye'�Z�{������	�/*�&,���'�O��43��7�
 � � ��U�6^g9F+UV��_��JN� � �@�l|���qS#�4K0g���A��Si � �dN��3g�NI���tD�@�K0��3<���@:m��t��U@@"%�r���n����Y8��V � �i	��U6rx�T��H�.N_�k	��gHI�k����B�����h<�D@@��lcwV�I�0c�m�/f�� @@ �rfKT�l��^�!��%��Ls����).�����%`��ta�%w� � �J�6~g��{�6�"`L_P+ � ���Z�q��w\��oW���d	��Q�]E� �L����`@@�`9FO��e��������@@��	��q0KWK:��)LZ�*����4�lM��`LH�� �LG���I@@"$`��cpo����Yp�A� � �@�jV����������2�,!MNOx�*S�a	��vI(+3if:���Q � � ���1�L��m�/fY��P@@ �~�`�>n�T���|�����W������U=�T�d.Ls�f��n�+@@�+`[��1|p�����Yp�A� � ��o��
��,�`j�f	�����fs�f���~h7 � ��H4�_x����9�����YN��@HM�%Ss�*?��J������A4�:C*`��tH�K�@@@ [���)���Kl�/f�(@@ ��?�����2�6K0�*����%
5[V^��)�yM������9H3����@@(<�c_���_���O�@���M�`45�K�cp��R�9H3��r� � �QH�eX�c��a0�O�@�`	��P��c��'og��t��
E@@W[���]�rr�6�"`�z*A@�lD�W���\�Ye��b�� �L��rO � �DQ��������p��B@��	x-�������P���5�\SZu��v��I��9H3��<[@@�_�D_�e9����6�"`|��@���Z�q��w\�� U�+\�p2�,��}�(�`��t-�g@@
M�0����e����Y8��V � �@l8�,�hj�f	����P�9H3�Q�w�{��g��������������#��3�<S�>���K��W_���_v��%�z��!C�H�N��<^�uA~���z�u��y][@@��`9����m�E�,��E�@@�@��������5	��o�%���	f���;����43I���n�*��z����J�n��Xuu��}��r�g89U0j������oJyy�TTT��M��s��r��7:�n��8�_��|F�|]������r�G@(�D_�e9�p��m�E�,}C+@@ "��`6o�%M���~�eeU��@j.s�f���������^>��S������NpJR2�_������+W���3e���2n�8���%K������:K�;�8��^x��rt>'��#S�E�D@( ���/����m�/f��Z� �.`�c��e����N4�@�%@�p�� �LG�D��_/��r����?��.�(!��Y���ej��K'����e���r���;�3��g���{���5^��|�
��N-����r�G@(�cw?��_���g�@
@�%����*+�>V����M��P��43�Fg�q�?��<���r��7����-��c���b�bj�8q��l��s��W�1n��Q�v��z~����u^��|1�4�x]����r�G@��D_�e9����m�E�,<�CK@@����oU�*SK2T�����$�w�(i������gHI�k�����43�JY�r��?,��-s�]��e�_j��+��B�3��L��;��3tg��iT�%�Yfn����'m��uf���SI�N���/�}@@���}�r���W����Y���� � �@	L��J��X����c9!MNO�%��Ls��`�+'�0if:�"�����g�o�!������GN�K-m��h�
�	&8�8z�h9�����;����{��G�z�-)--u=��CI������/�������_n2�6lH&;y@@r$p������6���x��z�m}��I"�����3v�B@|��o��3�c�]���43e����_^{�5Q�;��#����������53L��l���"�����}������w�}W***\�?���y�rt>]��z]��~]^*[f��q
 � �]���%K��mQ��.����	0��D.@@ T�e����\Z�*��� K0�����$3�QR�[�h��t�Mb�o����y�<��c����E-���pl��M3�UW]%���s�{�v^?�������o^������/Q����� � �_��Xa&|}h1�,|�D�@@ �U�?��g	���[��|p��-�[s�f����2�p5�L�2;��s����S������+�tf��d��kj��I'������d���2t�PgIFxKt~��1��Q��o�.���s�y�\���uyl@@
C������f-H?`�Z�v�A�e���wv�������/e�����uk��/��v�l�"+V��M�69�V^zp�
�:��2!� �@^	�`��9���y��R���5'�/�g	������j@�=Hk�I��bno�={��o~���w�\p���S'y���1�e�]&?�����l�)S�HYY��w�yN��������Z��������WVV�*G���I��������g�~][@@����.Sw���!�sv�XL�Z$f*���/��O>)j0v�G�z���K����XWW'�z��m����]��A����+^����< ���@�?������v����S{bs�F�����u�����?��9��Cg��f���U�V�#�<��S��v�*�G��~��9��������e'N�(�	&�K|��VnLA� � ����3f���k���na$�d|��gE�w�)���&�Z��O�L�Q���J�=�N�o,��9�`���ns��Ot�����l@@ �X�1:������cY�1���Z�i�K}f��n2��Q&555���3tw��)�����}{km^��E����nj&[�W��t~��:[@@��`9���?��+�3�-���R9�����k�uf�y���
���:��(�^zis��������_.���n�O;����H � �@�
��1f�
��L�`�{������	QT�MX�1!'r `��t��
@@��@��
X�1��,�6��d��4�0{��7e��Y��l.�f����;��A�~�i��j�F^ � �����w��e��;Q��
H�%�����A��N� .@@@�@l��c�]�Z�m�E�����K����>*�����N:����Z�R��������5jT�u$@@ �X�1��%���h��,�h�a� � ��!�r���O����3��?��O���O:y4h�6���7���_/��
TS�Nt���/n�.���
���k@@ I���%K��Mx��]�����w���y8��._N��};VV{��R��W	�s7�>}���N��9H3�)�E � � ��mv�j�1��*���������+��C9�/S�)��?�P�M�&���<�����U�L��eB�2@�'p�.}�/af(������DnJ���.;t���Nc��d�kN"�&@��M�s � �DW�0c9�p��Y���3�V�\)3g���.����������s��K.���w=���7_G@�-��`�:��F���+O�N�c���(� �Lg�*�D@@�,	�c�`�X�m��3�f���RSS#]�t���w�e�]&]�v�;��C������1c��^�Z~����o~���Y�^�F@�L	��f���w��`YUo��n��:}��u����4�lMXuI�3���5	�s���A���]�� � ��}��r�����eE2`����������s�='����_�������*���w�w�}�����r�M7�Z~e��Y���o:�(S�)[�n�,[�LN9����+=����� ����������5	/`VYB�����U�SZu���i���d�A��N��#� � �@0������jE2`��d�<���n���r���:����#o����z����W/�������_V�Z%j�YYY�|���w�cl����yk�D@�P�Y�q���*Aoy����IY��!h)M@�[���i�+�� � �a`9�0�B�m���"0K�n��}��M��K���;vH�����f43x�7��F@��l�3[��`�}���!��c�T�us�f��^1 � � �@���|�r�i�f�����Y���@�*��a�����3����X�`lI�����43�f�<@@h�}�����a0����� � �a��fjV��#��w�y�t{������	k.�q����&�yN fs�f���f�� � �M,��������Y��'-G@�&�Y%�m�Ix5K0&���	�`�)7�$`��t@��Z@@�)�r�>�B��6�"`���Y � �y�`��i�Jd	�l�Rn��A��[;i � ��
�V�a9�X�0���_���c�
@2&`�C�,�%M����*����0if:��P+ � ��`9F�R��g0g_�*@�����|,��i��euk���PZu��v��I�I����|��� � �@�X�1�{�6�"`���� � �@�D����� U�+�C�`	���2� �L��q4@@��Ul�2��*�m�/f��.� ����x5�c	FS#�4K0gO��0if:�� � � ``9F�J~�����W�Z@�P��,���6�$���������r�Me!0if:���Y � �D^ �j6�y�|�����S�V@p��U�.f	FW����,+�z<g��"�0if:��P/ � ���>{P���>�D�9*����Y���� � �@��?X������.����Ss8�A��gki � ��ca�l�/f���� ��`	���z?��J�������sS��4�A��N�X.G@@�,	�f,��%�,k0�"8E#� �@v��*S�+��]���P����5���fk�|%=�����$<�	
U����B�_�@@�|P_�}���dY��V�9d�O)����Y>� mE@�f��sV�{k���*P6bX�����X@-�X�f�k��\y8Y�� �L�ms{ � ������������60�����m�E�,��F@@�`	FS#�i�`w���p��43���
@@p�i��*e�p���=�ak0����� � ��c8�j�z�oyP�;� 
u����m,����U!0if:d��9 � �DV���2��qx$�
&��l�/f!�,�� ���Uv�"l)s�um�s��&�V�"���=���(��43�{�@@����e��i�����_����h% ��P���s�q��(u���I,S����.���o���o���cE�������c� us�f�����#� � �df��].�]~b�M|��_�|��@�`	�`�����kCe}m���9�z�iG>�<����R��k>"es�f��l��#� � �)sW��k|5����b
M&������k�-�����{�`�a3i@@ �,��U��^��Z9��U�_[���wN�yRr�)�����8�@T� �q\&����.��@@�*�g��vfM��9WG��Zs,������`� ��`VY(�!a#���[����N����ev�Q�X�1�('�.`��E����#� ��M`�S������W��=�W�B�����1�9�a[y���>aIF-�@�,�}�������d�~����9�uf����`�@=HS+|��9� � �9����l��Dt���Y�~�y��f��3�]M@`	�@�=+M(3/��aF���"�@Ks�f�[�� � ��R�������`Y�����V�"���}�f3�m�E�,���� ��^�����\F�����e�!}$k��B��n^v��T�7�t~�
�v?xIb�q� �L�ec@@r(���
�EU�k?��?�tZ���f���R-����
@�G`��U�����
V���z%<���	��M_[���rp�Y_[)�w�9��]��t��i�����@H� ��I�h � ��H,�3��o��W�,L�9����a���9� �YPK0N���k��!�������Z���t�E�����Xg	�G�>g>��`�Va�@bs�f�_�3��n��d��2p�@4hP@CC�����������]��W�^2d���������.�o>�_o���:��a� � �DK1.�{Hx���]3J��}�,�v�k���������k�q@RPA1�zjI��?��I��K}S�����%�D����&3�d�����z���
��W�$w�B,`��t���������|��s���2f����U0j������oJyy�TTT��M��s��r��7:�n��8��,�rt����Ix]������r�G@N �g|���O��uk���3f���E@�X�xSc`�:�;������ ���@fh!`��t��<�g��0a�~���u�V���\�Rf��)��q��9BK�,��{L�:�,9���\�_xa��_��������:�����> � �@0�Rw��p���*������_�0K�=�� � `H5X6}����VXJ�P:���;� EOf�X:�\��s�f�}\Z�Y�qY�t�L�4In����Y�f9����K�.�����e���N�M-��f�%:���:�x�����{]����r�G@F��Ss��������s������E� � ��D`&*\-�x����x������R���@ Is�f��,���o���	�
6L�9����K[�&N�(�w���s����Z�q�����kW�����w��*G����q��:�����> � �@��]���z~YC�Vka~��m�E����D@�T��]���('�&� ���r�l��43��z�^�����,�x�=�H}}�5`�f�u��Qf��s;j�F�\�Z�����y��m����4?�b*i��T���&��a��d��@@�"����d���-��A~<���9pH������W+�&U���������
R���h]��������������_�|��@��l�Rf=�^>�r�[��s�]���k5��H] �@��R7�J2%`��t����r�x�
�={��������GR[[k
��=Z�:�(���;cnS��z�-)--u=��CI������/�������_n2����"/ � �R`�'u����D���]��z������e�v�I�`����8��f��q- ��D@�,�v�zoc��j%r����G����{��L���T!E���@YN ��$3��5(���l�	&H�v�d��iNkv��%�\s����?��.�H;�0��W\!EEEr�}��������w�}W***\�?���y�rt��Jw���[|��#� ��N`��U�>���"N�H����gJ��E-N��8CJz^��x���_�0[/�@B.��d�?(�i:�d��M%PF���)�&`��t������g�����u��w���t��M&O��,����K�6m������d��}R^^�z^?�������o^���?�\�@@r#�s�Rw��`��s����Y��	�D@ 2�
��`�b��vH%H��#P���)B"`��tH���f�@�_��������+����������o_g����������/�-�����I'������>��S����C�%���3�����o�����7��3���N-���~][@@��	��9��y�Z!�WXYb~����}���/�d������w
D@ ��G�R�-�;���zf��TZK%PF���#G��9H3�ans����f�7o�)S�HYY��w�yN�����w�yv��7;�-�����tn���/�&M���*g����:�\���uyl@@r'�h)���HU���5$k*��e��6�"`��oH�� �@62$������d����yji�����,�U��a�
9P�Zj�6�J(��<�0if:��
�U�fc���a����_��U�V�#�<��S'�v�*�G��~��9�������m��U&N�(t����Q[����7�"� � ��bL��P�eJ�6�"`��{�+@�{�����Tg�� ��Xor��~T##~R��m����l������
��43Y�7��jl���3k��e���R\\,���������D������d�����������|l@@�+�(X�>��}�����J/��u7��_��[@""�� YD3v��2FIA���9H3�yy34@@���}�2kY���s��f�������t9; �����3�L����!��{� Y����@���43](��} � � �D���g#��
K3C��B�)d����Y(�~4
@����o���eZ���s�f�����@@�#@�,��(����m�E�L��E@�2$S��R/f�9i�H5HV��)�x��N�
��0if:�-�e � ���@���=$�n&�-.��e��6�"`��7�!� �I S���%��	�e�W��J5PV�},A��w%"Zs�f�C�`� � �@�	0�,�+�`�������^�J@�d��Yd���T�d�%=�`6Yf����s�f���h( � �! X�^��R�Z�6�"`�u�"� b���jb�3��e��S	��wf�e�/(�|0if:����"� � F�bL�W
yv�R������~�J@�*�n��_�r�}j�E�d���T�d��2���@>��43���D�@@���R��B�)����Y���D@ �:H��,2�(�d���S	�$k�#�q� �L�ec@@� X��%k!/��o�6�"`�u�"� �@&�d*@�^�$s2�C�v������*�@YR\dF �� �LG��F@@ C,��:df�)����Y���D@ % S����~�*� �R����d�q�T8$`�����@@@ f�%��7*�2u������{ �dE �A2dY���B�l�����fk�q�����8�6s�f�my9� � ����}��Fa)Fm`0�:l@���^jQ��3�xY�;&Aq�&K�a���9H3�Y���@@(@�9��9�X�l��C��9xH J���]��_���]{��H������c�YH#� �@�d	`B|�@Y�;��!=Hk�1�@s� � ���<g������#���^�Pr��e������BfZ�- �)
� Y:��T��$K�R�� Y
h\�Y0i|y1+�� � �bL������2�b��g��L%�����l@@�Q� Y~�
��g��j
Y���B�g�
@@�L
$Z��_�r�}����� ����2�����0+��87� �
�L��,2���]�l��}�T�d�N���'Kq���7�@ �� �LG�G@@ I���^f�b��|�b�9t0��2%`0;�� � �@��
�������5�c2'�Q�f"H�'��S
��tK�,�}C� pH����C9H!� � �@"�bL$��x�b�2��3��@�F�L��E���T*A2����nB�,��E� ��F��
#@@�$
����+�h�g�)����Y���}@��	�������&���3������E�����N �s�f�3P4E � � P�,��z�F=X��l�/f����@ ��
���z��	��FH%H�ZJ�,���V���43m��Q@@P�f���iF���@��b�4��3��@��t�,�Yd
� Y�o�Te���3jG��� �L'��3 � �D[�`Yz����&?����Yz�-�F�@&�d*@�^�$s���d
;W�����P��w���"#$`��t@��Z@@B/�h)��w	}��n ��C=`0;�C
@�T�L�,�N�yd��dJ!������f���9H3����Z@@�/f���_,�x��6�"`v�� �@�
d2HF�,�7����Z��0if:w-�&@@�C�`Yz����X?����Y�{ �y"��ZT�M��d<�,<�l���-A��	��43��P � ��!�R������v����NA�A��vL�f��z�oyP�;���Kz^�l��L
� 5s�f�����"� � �&��27�s,����6�"`���# �!�A�tf���a&Yn;����um�s�����8C�:���j�6�J�wso������G�*`��tX�K�@@��������h����3Q���6�"`f�CH#�d]@?o��%����r����{��K�,�#�v�b*P�|��1����%���:����
��t+�����r�;s�f���Fh0 � �0??Z�a����w���!���#M�����m�E�,���> �@�,�$O-��������;���<2UxU�
k��������R��j���#���qf��g���h�lBC�B0if�P���@@@ Y���tYj%��b�����=$�R��,�S���x%�@�"����d+����YjQ]G�,Y�����)S�5�\!v����D�k��
��>�z��dV"�@��43]����!� � �P���G��������N4	0����`0s7�, �@�z
��9��YR��:HF�,#�	����jW7��z��W�,~���fZ 1s�f�#���"� � ��}�2_��
���.R�\p�9e0�v# ������l�m��w?hZ[Z�3�"H�	��2�
��i���e���f���J��Lb��x�@ R� �LG
��E@@�Q���2
E�LK��,�hw1���_�L!� ��:0������U+���-wv�z��$�2���@X2���KN���30����r��9H3��u��
 � ������\n��em��.���<f0k��@ ���� ����XOt���]�<W|L�n��0��n���v|I�v��v?x�oq�C
V������a�7�q�FY�|�|���r�a����/C��������W_���_v��%�z�r�u�����u^�7����^�y����E@�.�>�����d��z_#�U6>�����Q�D��o��_�
�m��EV�X!�6m�#�8B���/'�t����U�d��C�o��O=�T������J��� �@���T���O>�l���?�_�N���O���W���JY�s�>�l����]��DX������w��w����Q��/��B8 ��v�\~���4*5{�ly��7���\***��]�����ot����q�zy���5W�M��:�?����> � 5�9��9�$u�Q��()�o2�]3Jl���K�-5m��H�6l� w�u����9�@��m��m�.�@�=���r��3g�������m��<]t���G?�T��)�@ Ca
�Uvm/����ug��+_3R ,>���:���^\>�9mK��I�8�\���J��&�1���9H3�Q�P��v�Z����&g�y�3�l���r�
7H}}�<����C�g���2s�L<x��7�9�d�y����������;����^�����l��T�F�$@@��	<�tS�������|��(��u�@U��Agy��������0-H���W$f�'O��[�����*=z�����J�L�"��ow`j��K���Mmm�T����1@���-(��S����_������w�E�����Y�:���|(���+�fYD�{ �@��L�c*p�^j)���c��W����@ "� �LG��}����[�Nn��v���t��5k�3�Lm�t������;V?�p���j�Y����{��rt>'��#S�E�D@"#�>�R��T>�
��G��<������v�����7��F�6��\�L����:��(�^zi������,X��Y�C-�a{���)�V�����*B���R����4��'�A1u����w?h
���E�d:�((�U�
��^�G'���/�u_�t����@r��������`��/�s�#������EHQ�����+����/��y���'����e���1���cT�@���������;�y����T���u�����e@(t�Y���]*���/�Zr����!8��e�g��g��f�V�"|q��a��W�f��c����C=We��5r�w����*`���o7�n��Y'_��S���8���+�T��,@IDAT�9�H����Q���X�������w��K��KfYD��� �@8�x����b����p���w��3f���/S�Nmn��I��cG�\����Z�Q-��f����7o����W9:�Y�J{]����r��W��� �����o���o�&l��]��e?kZ=`�[M�6~Z'�5-���[�	����c�~�����@?C��Nc��d�>���>}�d����W�fK�.�G}T�7O:��f���jQK*>\F��|�L�o.._����z�Z/�����o�\ns�>Qu���t�j�([�����GG�k;Yj��Jf�����-�G��7~R���G���?i���J�����P�ow-�V���f����n�|q����wT�Wi���m���J����l�7um:57�����u�M�h��B�C��I �$��A�m�V@|)���A��d_|���v�m��g��tU���������;��3��{��G�z�-)--u=��CI���=���b*�`���&���a2�E@��
���T���s�a;����I��gl��e�k���>O�%&���"0����$O>��L�0A
���$}�����a����/n>n&TP��O>q^;w��?��������o;�Z�Y��tEQ�����B����D	�K��b���e3(��~�]��n/
EM�|
��{'��� ���r���yd*��#��hj�f����
)**�����<,w�u�����RQQ�z��'�p�{����T���u�����e@(4�%�#@F�����v�Y����W�L�2;��N<�,A/0k�y��WD}CP=�L=�L�>��C�6m������������y��m�����u��u�3
����Wr^�!���R��<�D�yWz��y�t�
x��<���y>X���h9 vs�f����l����Af��-j����rV���S���%���K�6M_�Qy���*��o�����y��3�rt�l�_.� � �@!	<�t�,X\���T���q'&<���{W�\�m�+�+���T���h�x�C���W�f���������{���2���ZT�K.�DN?�t}�s���Wk�_s�5-�����7`F �p��D)����x�u��]��x>X���� �$`��t�}���TL}�Q=wZ=���?�,Z�(f���>��Y�Q]��dt;���*G���}��k��y��n��u~����E@
M`��U�������f���+�yN�8P������������YK?��m����Z���.��]��w��,���T�k��������Y�CS3�jjj�K����c�k��=������j���e�n)�w�(����"�-��H>
x�r�u%
����e7����f�A0U��xi�@�"`��tX�D;^x�Q� �gF��g?�����{�������S�L���29�������?/���r��7;c=�������y����4i�TUU9[���^���"����c� ��K0f�'�D]��gm};�$};TK����������=�j�v��C{�G�����33�����5k����z���o��u�l�29��S��+�l�����[�~�m�����%>����2���u����J�.]*[�lqf��k~�m���U@G���(��`���T���n}�%����ka���� �@K=Hk�Eb,�R��=s������������U���Gqf��c����G��~��9Y���)G-�8q�D8p���I}��z��u�,�4 � P,���^T���oyPj��(���=���[p�0�b�D$fuuur���;�5�L}#Q
��r��[��62g�y��7��[o�N�:�
�}�����:tp��8|�p���~�m���U@��):Bn-��y'�N%�n6�
���Z���a�p� �@��AZ�������s�K����Ux��%��w�^g����o]��y����"� ��,������Yen5���:��q�,s�j:g��t��-��o\m���/;v����;7/�h�Wi����i5P���/���0����k��c�����[�B�"���O���~���H��@2�Q=}XM�V/����4����r�P�I���S.z��w�>�6N@����5�3�Cc@��� M})�L��s� � �#�����������r�Ye��SY���>sS������X']\>���#��m���Yb*�-p������>-?��~sS��3���g�]�/�f��M�	�.�t�<��y�������������
[@�(	��43%�@@ X�1�}����>G��$w�6�"`��!�s �`�&yjiusM*p�^����<���8�u���-?
C�Op4�wJP,���������� ��0if:z�1 � �@�	�c�{l��Q�g�����25��dZ$��m�E�,uO���@|����V����rn98�O����wmE@ :� �LGG�;E@�7������#S��%����G���6eV��N��l�/f�;rE���]%k?lz6U����<�%(�GJS@@�0if@@�(�����%�U�ysU�m�E�,;�������D�Q<��H�O,�������CC���G@�|0if:�N@@�%����o5��~���K02�,��fi��3S�t(���B���-�G~7t��A � �@:� �L�S&�"� � �I�`��fSY^��T���[xVY���K����5��������V)#��
S�iK��G�=$���� ��_�������"� �DA���#�W����7��Yej	����}�H�Tl�/f�jr]���f�����
��Y7���l@@���A��.���~@@�+����^O-�����;i5Q�%����3��%3��Y�m�E����A���z���FF���I��*��`~���L7s# � `0if���C � �d\��s9������27��s{��r}V��UV�},K0��es�6�"`�Mq�F���&�Ef@(s�f����@@���,�����19,���k��uk��fgV�+O�N��_���M� � � ����9H3��Wq@@����zYR��@��a���&����`dVY������Y�}B� � � as�f�#L��#� ��@ ��e,���S����-�.���2����i0��6�"� � �� �L{\�i@@���`����g�����^f];�W��g��U�|J�n�Ye!x���_�B�14@@�)`��t45�k@@�l
����,X\-�m��V�D�W��a���-c)F7-?����eU����
��_�r�T� � ��� �L�yH#� � ���$K�fMr*(�^j���'8������Ye,��p���m�E�,T]Dc@@@ J� �LG��{E@��@&�df��=���d��}Q�V:�����f��t����r��6�"`��>�6@@@�Y�����$@@�!��d*�Zr1S/f�5�(�x��o^$*P�|��1�};�$};T��;�*�����_���W�
@@" `��tn�[D@HS@��K?�,����.w���L^~�3����X�~m������;�Z�N;���g�*�2���m�E�,t�D�@@@ *� �LG���O@@���l�Tf��@����R��"���79�k?��?�t��yb.����]�_[����9��9��U�������Pu�A@@�(	��43%�@@�[ A2�Z���@�����oyP������
��YfGQ&e'>M�<�k����Yv$MF@@��0if�0���@@HG� Y:z��5d�����1��]��qG�0cvY�J��m�/f��3Z� � �,`��t�2�� �DZ@?w��%������������ Y3E�f�,>8��R�f�����m�E�,��E�@@@���A��.����@@�H
���=���z�]*���Ne��s��3�T�<s�J+���Kr�e���K��C����K���b����Y�;��!� � �@a��43]�w��!� ��p�%���d�5rx%A2�
�5�\!jW�X^���������~������e�K��c�J~���_����h% � ���9H3�x�� � i���^�����dQ�If�
+�p�cW���a6dE��D�[��d).,��������W3�T�L�xvY3K^%l�/fy��4@@
I����B�G�@���~F�ZZ��j�[Ov�E�+JA2}�fp�L��*8��X��A �m��������V�}��V5^�'`0��~�� � � P � �L��q � �@A���	�i�V�D�?�<��-�`���a�Ya�*���Y:upm�l�/f��3Z� � �,`��t�2�� ��L@��ZR-��P�5rx����e��c��T��A��V�r���\z	��5����0�j��0��w�d����|���h�=w� � �!0if:�M�I � �@^	,X�I�ZZmm�WPJ�,n�>�����$z�-��B�U�T��r�N�fk�v���1-����>����	�YY8h��_���]�����s�A�!o�� � � ��� �q\&�����@pp���t`*���n��]x�s7'i�����w8�����5I���1b��KP��p��Y�,3S/f���r.^���s��[@@@ �� ��Y���@
V����zo�z��+�vT���d�x�l�~f��S=7�s����Kz�!��3[P,��Xs����e����Vem��w&9����gaOs,�3�$��`� � ��X@��
f:���:@@ m=S+���]���������nm��De�����D�f�����T};�$};T;��Ve���%�N{�f����-��N�\=����Y���1 � ��D�����4�f � �@Ht�(��)7�L��}��������Fr��x��vZ�I��D*8��)$S3��}�Z/����)���m{���I�<���m�e��[��y�r��u�v�[_[)�w��c��K�����������a���L���.���y�~���<�+`0��G>@@@ �� �Lg��shhh�W_}U��}��k����K�"�:ur���Y�y�w-�� �@�t�$,� �ak�
N�^�G�]r��Kd�=����.� f)�`�w�Xcm�:x�����#������T�Kn�����'�f"8_������Y�0�z��3��]&��l.��t��Yc�9iEOv��_NQ�y:����tz�k��m�E�L��E@@r,`��t����TPk������oJyy�TTT��M��s��r��7�Ga���N��J����x���-qJ��|��i��I����X�w����H��)?��7�~[ H�8lmR����;�y������3������������]r� ��%>���^����}��M*�����6��b����?r[���cW)�8�)J��/��/�?��W�L����b*�[P��,m����s�eZ�m����tU�@@HQ�������$V�\)3g������q����,Y"�=���u�Yr�������;���R*7������/O��%MK2�B�B���,A�h��.�)?��7�~�U ��%�r��M�?|��r���Y�q��W����$
t�S��<�9��Ly�)�A3�m������D/a���n%���^�YS:2�)}M�[[�����f�tP�����3��kRM���z)���c�t6�q
�G�l�/f�{p� � � s�f����Bo��Y���ej��K�v���/c����?\���^+��ujYG5k-�r��e������!�~.H��f��_G���N���~K%�V�WH��r+�o��}�rZp�t_A�[^=[�m�r��������� �4-�V��G�w;-���q���e%����mJw.������%�fs��r�,?m*>�D)��o�hu �3����Y���@@�0if:6{��8q����[���S�Z�q���2����z����]��T�.?���_����{�����'o'���T��9�	p�+)s�h�?K/�l�2���T+��0�j��T�}��,�������i���%�~ PH���B�a�@@�J�������<n��I��cG�1cF�]�e��������m���S;^���i�����,�3]�>��7��@�Z$N�<�'qj�H��(!M�	�b8�x9e+���A�'h��N�9���������~��u����1��Y�P?+����#���N	��_�
���@@�/s�f���.����G����:J���������{������zH��oz�����������5�����a��l1y�mx@��q5l��$����6��L
{#�K�Q��EZ�g*�������&�v^�q0�k�I�J�H]i�L0]z]����������WM�5U���\�����������2Z�!��@�>}R��������+'@@@��	��43��)����+���H���>���u�]�����O<���9���u]EEEJ������J�l��J��I������7�u�Ll����"N��D��g��V���T��]T�\���M�������_�1�T������_R�Q�w���{J/�G�yN�U�L�J�74�.p��
�9���$�%�EV@@@ � ����{��6O�<Y�n�*?���i������*��o_�g��^�����T�.?����D�n�������'��N�-��q��M��q�O'�-���������\��@��R�f���K��9E����U.�?�oENr!`1�,��� � �X�A���d�P���/�-�����I'�����G���Se���2f���Z�o�.���s�i�u�Z��o���(�?v�5�����[i����)?��7�F�������mJ������z��\^u�l���m�E�,��
�@@0if:�E����7��)S���L�;�<Q����^jkk���o���JG���?�I�&IUU����N-����������[�B��V���Cc�q��O:��jv���6i	�-N�>�l������6t��`����9�t�M�7�m��z�]-%��:������ �����Yn��@@h!`��t������U���G5�L��v�*�G��~��5���m�8q�8P&L������|s�%����z���\�th��p��������)l�QNalA���
c��6��;r!�@������� � � �@D�A���(G���s�N)..����[��w�^g&Z�I��������}���z��������I���8��~�&�8��������s" �@�����=�@@HK�����
�b@@@W�����+'@@@��	��43��)@@@�6�"`��@@H������C� � � �@$l�/f��zn@@�(`��t�J�@@@�B����J�r � � �w� �L����`@@@ �l�/fy��4@@
K������K�@@��m�E�,\}Dk@@@ B� �LG��[E@@r.`0�y7P! � � �$`��4> � � ������e���@@@Ws�f�]/�$ � � �@Z����H�@@H]�����K�J@@@/�������@@@�,	��43���(��
444�{��'�����������!�@����/�[����s� �@���_�
���=@@��9H3��m1-�w�������^zI���+?��O�G����W_}%3g��-[�����;��N�mR�u��Imm�����C�m��{���5K�
&�
�=��}���_�^z��%={��M���{N�.]*7�t�|�[�
�M��]�=U\\���Fl��UV�^-g�qFh����_��o�-eeer�)��"H�b�
�7o����2|��PX�������c�P�����iO�.]B����;wJ��m�M�6�P�[����g�u�����������H�>}�:��O_�uh���~�o��MN<���D
�Z����Ya�-w� � �y(`��t�
M�������9s���G-;v����z���;����{���?�\~�����`���}��'�������������J�?��@{����W^y�	���7�	<h�p�B������������C�j��SO��/�,c�������_�mQ�� �
�-Y�D���'��v��5JJKKm�
�^{��N������/�0��������:��T�E����W�L�h�Q��������O~�����/�'�xBv��%]�v�+��B*++�?�_��Ow>��:u�t��=������������/=�kc��q����Al��Un��f�!�w���p�/�z�����_���Q�t���Gu��0b����~x{��T�\9������p�_���_�">���p�	2a��c� �>��Y�j�~��r��'�&P�e�����"`�%�"� � �90if:�����2������/������
�Yf���/���������N���[`���L��$������	(<��3�c*���gA���y'��>V�� �fj���d��q��woy����f*���u~���x����<���~��������}����fQ~���w���<Al���#�\r����_��Y�|Pd�l���r�
7���#��?�����k��K�]w]`�'�2�*���\pA���N0W.*�q�a��o��l�������sq@���6m������$�����=i�$�������������KA�������Y��wfX^s���=��#EU�Z�E;�|]~����GT���)��K�����?�X�~�i'�D���_��Wr����U�u��	T�S�P�c�9��]�����L�����%��0 � � �ms�f��]/�GO`����3���2�R���;�E]��|��@o�������|��
�x�bg��?��?:K"�s���\���]v�]w5�x���DY��?���y���nB����m��������{���~��_�y�����?4W�:�Z&N�.[�v�3�E#����%���*8�fQ�����w������_/UUU����������i�����Q*���
0��O
4h���j��R/����+�e�bk~����Fo��'�?�U�}��@�BU3U�!W/��xjYH�W/��J-g�����q���W����E�>W�����c�~���{@�]��.���w��A������G�^R��f��~5����x�@��R3M��2���[��S*��2��g�v�����e�
�������2�;J\�3�r�j�b/�4�Z!@�]^j�j5�T-�������S�O	��C����_���84�����z�����@@@�����y�e�����u*p�> �3���>*`���>V�S�\~P��/8��o��v���y�9���w�vT�����k�	$	��B��  � R���t)iJ"�P�$� %�"EC��T��B��z�y��:�;wN�wN����_r������{f�^�^k��k!���]�P����������O����0��l��V
g�qFA4C���_��U��9��#:y" �!�����K/Y�BP��^���/����=���	�/s
N����Q~��_�u��GN�8�����}����T�O�^u�U�D�����l�G����4��e���Q��cx3���y��yFb4'�����n�a]��8F]������%td�
8<���At�E3�q}k]G�;��G��S�!���{�#���w8D�k��?~���>���-��b���z
g��i�����3������m��f�8���hF�b�l��f,k�������}���>6lXx���-i=�uxr�m���V%�W�#��.��3'}��G��7<�?[l1_����X�+(��I�SD@D@D@D@D@�L �I0�3�6;�W\qE��H8D�:1*�o�
60��z���qc!�a��4t�P3�3��z�_���x����).�1��4JS7Lc��.�!6��Cn�Z���8���$4}
y]�#�4����|��'Q#+yf�3�!��l���!���|:%	�Y��3��n:�^KQ���QV���3���f��0����K���
��F[!f |�y��vn�����1�l�Gqq�0�xp�H�;v��.�!}����������J�%�#G��t.& ����5� �L�6�!P#.P_�r�n�/�2�N<s�~�1"���0����fy����� �_��a�������������y���Z�F�M�jC��._�=���}�X����2iK��h��;��'^L��^��`���w:u���I?� �js�u��2~�,���������
q���~�'i�(<���cjU9&��o���1�n����;���
��$�)" " " " " u&��4�=�r�����	���q�
����������z��k��*���Xb	�(�&�(�!�dx�4�`�$��2/�j"���i��=��h���GT��|�����
�,~��a�u�e1����-��0��eC?�8N�Ape�;��!!�0�2T�
y���
�K�<���z�0���51���.�Of����D|����Ky�#?v�OD(���i���� ���7�C�
mGn.��2��������?��Sk�z���!a�{����u�X�� a���mT�guu��D?�����Gy��-�K��C�I�����@����0��o��9J�g�<��c^d��;���Md�s:��x�vL��0a�=+���U�
�C����(���J���d��f�mf�j����3�z���gh#�4�N��WN~##P#���(B{����D��{��"$����j���A_�;�����V�d�M�^I�S<��=���W^y�����w5�V���U����p�x�!l�Qh3=�;������:{�`�$�)" " " " " u&���:WC�kC�c���*�2��;d�b!��� �!�0@��0zx
y�����,S'D 80���~�1BE"vP0R���
2�����;��c�@��}�����M�8�hlx"�=����sc��Q���{�icF��+�"\^���w���.�e��m���px)��>^�^�?F`<��^:�MO>��>����R��0v�����ik�l����
m����^S�L�>�X��c��,��]�i�;�����@D�Mh=/��0���]�Y��V���`���fd�O�$�L����N;u:4!n�=�Z�/{�,�*AT)�~�����0��1��\���W��x7p>�%�8�1Dt�aG���\�����7��i��|y��7+�%�g�wx�����x.!��~y��_��L�������g���S�_�l>|x�;���;�Il_��������5���s�6#� N�LF��[�G���d`�=����_�y7���.5a��������$�wLv�={����Y���u��{	1���]�?<�y6B���������y�����h�W{�g��b��(����y��=h�����W�SI����i����@<����?��(�	$����/����$�aGb���.�?8vV�#1�f21�t$���D��HB
�I�TIX�>�,�8��{;���%!�:�\�S$���
�cG"�t$���n��/��#10u$����&����%�J:CUG2���Ki0N�����2�c��v����x%�J���Xg�cfa�I����X]X��B�%i�KD��C&����$[L�$��]K"P��\?��]v^��T_J����m�N�I��$�d�{���L�v$B��)���H�]E?JD����'���x�t$�g%��6������/�4�T����V������x�.�e�������k�S��#�i��X^8%�"m��y�$������p��7v$Z�i`�v�@�AOD���N:��3$� eV&����#yG�W��sR���w����J<;�	��o�����Y����'���E��{%�;}��/:������K<��h��YOIBgZ[%bF�:'�k���^�G5����n����H��.��+��w%	+k�8{w��V��3��h�����=����L���]�L&�HD��D �~E��9�;���c�=�m�Uh;����d��j�*��Dx�H�`{����W��M��9�x%v����3v
��C��`��j�J?)u�q|�g�o�����#�{����s����<�**" " " " " "����x�u�9������G^	�����1�<���
xP��S3�sJ���xK�G	/�{����I�XyT���yBbP�Y�x������S���<�� �y��C"2X>��7�8s������x��
�D>B��1Qjvww����5BX,B�y�2f�����x�P�5N�Lf��/9��y=������{��xEyI�g��*1�[hO�J�;C4Bg�yB���_�<�|��>������~�.#Y�f���4i�1��l��a!3����/��E������Db����U
?<9hk/��o�m+��^x�������t!L$Vl����	�<�L��~�{���/��D�7�CN2��<�!^U����s��3�����y�>�����0�x����0�x�2}.<s�H�Q<�	�F�K=C������X)<��	���x�P�{����mJ�O��i�{-������������}����_?��[m���d����q�.��'�<	S��w�@�����>�C�g�8rN���^�x����<����C[�Fxo��c;���|"�)����>�/��C_�{���y����>}���^���x����W�����~�[�[���',�w�3�����v�}y��U���������8'�[��(<���x<��K\!Jy.����g�7�^9/1�����Y�oK��vL?���cw������E�����������������@<H���\
��
	`�x�8E����Cq�����T��(%���m0�rO�-�N�?��F&��s��#bUV!�s��x����2��!	o����s�U�PO^��}�P��}���0db\���(��8g:����0rU=�Q^�"����}�����y�9�	d�����k��O}(���8��}�����������o���������B,���!�o����?��q#�"���q�������a|G�Fp"%m���h_����#\+��<�/!	���~��{�����}��?���f�_n��,l}=���O��8g#�{��g����q^����bh�b�@|Fh!����|e��#�%x��s��2��ea�E�md��0v�����jyf��g��~��L4 �Y���M��8' ���Z!�����yA]YGH[>y��������2���G��f����u|���<?k!��9���h�,�!���.���y�u�]�A�wm���39#.�|@�^C�B��M�����;���XF�-���x���������s��w�I��g��C���G���L8b�9;��CaR�pB^S�f����&�,���c�����������@7	���x����n"P�)c@��T*ONUNm�����1�g[9�,�{�b�Zc�5�t�m�c���m�����X���l-�2.#��cV1z!lb��<7�C)r@�50j��<��x���B����"y^`�q:����er����6��W\��	1��e>"B���='���}�}�Q�RJ4s���#r�,��x�`�����6I�@n>ku�s�r1�����:��Ze�U����Bb>����r���{�Y8�f��6�p��g��@E������5]�c�<����g
�����`Z�������A^��C�Z�4O(<H��aB�I�������	DD*�+����g^��������w�`��EAx����G�w�\�%cx"Q@x���y�'���iq���������m��gY|.��<3����fp�_�����x-�!�#�������nj�< �|@A����I=yO6�O B1Y(~V�on��box�x�3�����o�w��\����!�C�!z���
�w3�#�b�O������e���W	fND�" " " " " "�@� -^n`�tj������ob��)���of@#D!z`����!/�o~��S�>-��qNB."���������nN�<�\�����P���q�*���E�����h���.�q&n�������jl��:G�h�a~�	W��D��B������#��jT�C�g�'W�b5�c��A�k%T,�bl(G$F0#d�`�#��X�c���~�H��Zk�a�������������6�b�������n���1�<Gyn�|��wu���:���G8����9MX]�D{/0�y�N�����hS��$G���>|�y���N�x��|����b�D�r���������+�<).�!��.d�A�^�~��'�^�I.S���=���A!)�]zI�X�������i~�����o�9�w<������:�eyn���lacn��%��T�," " " " " 
"���UG�����0lb��9��j<^0���Y�h�sCk���uBH9���,�!���T�����Gx
a����.:�9�}b�&�	�v���J^*cyB9!�.���h��<@�z���Tq��x��,c������\�E3<�=�\k���U�8����M�5���>E�L3!�'�7xLy�&<8��^��y������C�0B���D���6 W�U��xFalM������m	'��&���}4"x�!0\x��a��7;���y}���7N%�,�
�cD)��]BXFDVBDRDD<]`X�B�dy���@�E�D`����*�P�b���O�2�71� -�pO�����s;���y�9���.�,w1A�0���Y�,������%�o��X��0�xff�X���D�	���B?BH������9�Ls�������s���+
OD/A���l� �3��~��������h���R��E�F��������f���Ob���^�<����)]�j��)�Y��������������D ���9^���&�f�����q�z�
�k����� ��h<�0�bx�W�X��vf:�
N����Cy�!�0�2C\��7}�Z��0lb�E,�
��}�������kMpa��\�`����0hq��j�]j[�2��i��b��h���a��`6��.}�fA&���OB�	Q
5�_�=�,xDq?��0#.������}���?B��������py�iY9�����&#T�"#�z�J	��Q�E 4�@IDATXB�������0��!u���`�^Y<�x�')�6 /�)����_����	��D_I��y�^	G8?�[y
.�Q7�*�+}����4�����Q��+�bZ%�va�
��xGQM[�03�\|
���&���xp���x+��|��D��X�({�nl�x��=���w��I��w	�+����f�8}�]��&�+���,��������G�?�A�i��Mw?	?K�Qj����C<j)����>�~p�^NO�����^�y>���c}���m3~@��D�%�'?6q�.W��W��r���" " " " " �C ����C@W��\4��Nft7CA��b��{y��pK�1V1����{���}i#UO��P�}/
��bF2?
��y�e���6��n�����{�0���)�]���#y��t��tG,���������y�
�2�S<�96�.��,#�y�9��g �4B"�M��U���=�0w�e�� wF�y���r����.���3�2B#���&��F>�<*���#�6�lcyv�7u��W[=�������:��p�������;���������^x.������U�|""�x�a#e��"}��_qA`$�'f��g�""z�����X,�n���BOl�G�?�A��R�m�Y��&��ex%�{V��uQa(�������"��x�k#�t�#�s��	�xS���LNN����h���i��9�;����D�t��I]y?���yLA�bn^�,�3!���	A8������9Mh�J����+����������|���.�����xP�y���yO�����Nv��m�R0��@�ff!�2$!*�1�5xy+��+�}��j����������@{�i�r{��U�#�e�0�7�X��j��x��i���}=0�!����������<�F�-�&bS9��'�L���,��0�#� �b�� � �1���.�1������y��xm d���(L�,���]D�C���X��d6|���������igD
B��B,�Nx 4#�����X^�
klH�/��.s�<
9�����8�x�LC�#D^,~PO����c-
^4�]DHAhL�f�s"��o���\Oy�O=��q����x��H��� �QWDl���q�k����2����D�9��?������j<���c� ����g.x2�`�
7�j��I_�/���5�/�����X4C��~����I�\���@���u��X����E,�{�m�����y��8��!"z�6K&� Z#��0MhO�EL$������<��vG,�><�yG������r-��`����Q%}�{��,L�9?Ks�\
�}��3n@<����������[,]���u������uH� ������(=&��#��X����7�XF�2�0�andA$"������K����5x�`�`��PN4C����X��Q�O���9�0x��1>���W���,�	��8����s�y�rQ�� L[y!������x�.�O�����_���F^��1A�0px1`@����������J��`�"u3fL�?N������S,���'�-�xL���0��K�����u��b�,%�a$G�$����*V����[��d��7Y&T#�����Q�����y��c����~�m��s���
B���n����Y����Bh�X�D�G����|��m
G�!n��U5�.�-m���Gn����o�4B�@(x�Z3�����e��Sr��h��8�zr�r���x�����so2�������n�s�;B7.1��8B��o�i�.V"�p>r�RW����5��F[���_y����p|��r�^�x����������n�Oo���x93�q\Pf!0��D��J��A��m���:������������@<H���sv�EK�2���J\��z<c1������`������*�0.�s+OV��j+�B� P!�U�3(>FO�=��!
=�^��f�&M2�-�d�c<��u��������
��XF��	-F.�,��i&�����4��b^P��w�A�7 �U�#o!����W$}��y��b��"N���S^i����{�B0"���=�#����BA�E[Y����D�L�eDd�5������x'�}C8���������������_�_��]�k�
����m�|�=�g�a�f_q�o���2�����_~��Z%L&�D3�b�9o8��<{h&7����Z�SXF<�'��M��h����2�k.��~�L���}�����O��"����#�YJXXy�����hF��?auk��������}������I�����ba+���������6" " " " " ��@<H��{����D@Z�@�hF^��8'V=���U��?��������xv�G��E3��Bx�����6�AU/�b���o&�g���h�!>i?��y	.���9��D�A\L���[��!��x���1]�X��
�4b��z
�'K4�{:">�U^i<k��`�5�}����JO�y�>B����_��x�a���<KD�8���`k]���nM���a)�a�a�����#k]�����d�b�%,)^@xY�E��Aj�@�q�����3���b��oW{�j�>�����Zx�r���{�>����6�;���D�B(�c�
Tx���$u�Y��_%�Y��V���;��3��Q�c7cnVeTc�d�K%���\�q����ob��������������@zFjg�i�r��1D@D�'b��B2�Cg#
FUl.�O(=����"�\�
�r�r������](�z!te��_�1��{H��t�[X������73���XxKa�E4�� .�i��C8��0!/$D���PWW�c�9���	�#^��B��q�����}�
%�����:V,��=H4,���$�f^a��z��x��W�����������6�Y�wA8 �%BF��C�J��.����H_q�����.�q�t'�+���%���VYe�+��:T�?�~�!��oc��0JF�#K0����2�'J�3*>^Z4C����gW���[�2�E�3��L���q���y�YB"">2��_�~����7}���c����g��f�h��|z!w)��V��.������!LlV���h�d��������3k���
��<k���V_���$�����{��������.����V�q����o	f�P�6" " " " "�/�Z��W��o�u4��0�a��E��Jk6s�LA0(b�C��N��i+=GO�#���	��1��1}_B��i@��t�J����O�9��z#��3�@��p��p�q���x��;�b����{�N!�^(Gy�}��0�#�!
`���G���\T�x���D�<��'L�`�1�c����F���\4��C6�J��Ho����FB$�bA����o����>�|�Gm������v�M7�'��O���y�k/�
A���/�N�e~>���|�����_��WM�m��$�%};�9@�N�1N!8����g������+uB�A��|��yIKyg����9+|���;���$�0�~K�fx�"�#��XV,|kw��>��C��YH�EC���+
&����t�D����s��3�;�r�l�����!����Y%����`�>�\��_m'�q�p��]���� ����R�~7�pC��M�G������������x�/�]��@3 #B�T)��#+!��0�7�X���7�T�7q��@�x�3�#�=��#���3x�!l!..��r�u5�����M��x��;<'��c/�V�"=^�4���_<�:q �DE��BLtI�f!g�yf�s��Zk�� *�~���N�����W_}����z!�0J�-�qq�E�����O���n{�^Sx.�}���k����DT$�b�����<6qX@�C��� �����xxAy�Os_�2Kx���[�Z����#��.��9����?n����oo9���o�����m��\>]�\��B6lXfN����=��=����X�y��>��S��x��x�Js����Iy�e�1�����V�T�`�
��K�yF�7����e�n�H�eL����;,l'�y�5><�"��|�
b��5�j;��<7����-��7����o��,�����J�#�o�����SD@D@D@D@D�����x��H��E@���"P�>}.�������3�<�0N���jB�ad�s�9soJ�%�9�O�^T4��J�)���#�v�8\b�C���=�0�"\����{x��!/��V[�DE��jQ0�.����w� K4�
��[o�5���\sM��U*Tbw�J]x�#�x9�����$������Oq���j!�q<[]�@EA��� j��{
�>b(�n������3�B��{f�=�(�:�����F����7n\I����\���{�j�;aDa�}F!Ju��i7<��[����E7�g
����-��#u���������x~����x�=��mF�w1,+�&u!r��#j"�q�3�@�C�/%���K���B,�>�	�}8�xar�x��I�o+���}�� }1�/��s��{��|�72b0"�����.x�"�q/���k���`F���E#{����\���������NY�~t�J���p�[D@D@D@D@D�=	���x�=i��E@���e]��_�;�J�(�f��?����kF}� �
C����}��"|!�9����4h�o�������|��z�t�C������l�UW���>kY'?��e���������-OR�=Q�:!"n��x�`����h<%	I�m�-��<�
??�.hp?q������*D��Qt��'[�=��Z�@��:X���z=�fY����~�(pA4#��^�S�N5��Z! ���`F�=DQBG"P�s��U?�C8f��@,%�qN��)B���a��CB����U&�����t�5����%����X�qna�A,���h_�{H_�>��E������C��G'�J2;��Dm��k��������c#����.�
���L��Mb��5�j;�4>5r����@'����^xa�r%�/�r������O�@<H��EGD@D�+�D)^v��w��������v�JE�Z� �"v�X ������������e��!��s��hF�>�{�{=
^l�^z���C� $#�����b���u�w�y�zT���X����^u���N'��8�,���L��UQ��^z����g�'^gy���_&	���E3���Gn'���|a~��	����`��	?N^��.xC�&�����D������UB�r���xQ�1�����c��g&��#r��~���m]���o���Z/y�S��%�������q�{Q����Up&B0G������N}����s4b�O�|�����;���~��1D���l��=��=y������]�" " " " " "�x�/������#�%�a��S�V^@�������W������i�k�O��=�����$y�O��������CF�}���g���x<H�j��"u�H�8��������	?O��b�i#���q�s
c�Fm���2
VYe�����^x����������=���
��{��hd�xD!*��V������.��*��C�=U<���C��x��X�@�w��Nd6�+b��k��y
�������������p���M�p�x�Q\�������8�/�`^bY��0���O�xr��0���y�;!���X��#�N3����~�����m/�y#�Y�=@g�������L����+��[�" " " " " �M ����MEW/" "P�@,���p_�C\��������8���7L��(�WbF�z��/�������+�+�\F���sL�<k���y�\P����L"_Q����za�G�(�[V�S���g��\4#*��h7l�yD:�|�� (�w���{��XV.i�u��oW��;R�;��g�i��)��h1~�xx8.|�����h��B��y�
Q�'x��g����C����]xP���WA�E�{��G��a����w�����$4#b�!�!y�x��p�����JN7�M��xfr���l�
7��isB����*������P����(�a���u�����9�����5�y���W[z��	U����i�rw���D@D@����f���z�PN�8��4�`�M�%�2B�a<���O<�d��a�|k�g���kMH�p^���
�v�G��Sc$G|��#o#9�g\� �c1�o�7�V�F���#n�=��F��
�0�VB��9�<���������K��hCD3��H���X�9�w��+�L�U7r��yvJ�����������r�x��������#c����E�A�]��:��!�#yy����j��;�������b������ �>���&��t�u��P�y������� �0���ze_<(7�d�<���K�����D3r"^�����	L�$�a�%����
���@�W�=��c�����zz[�������y������������������x�/�D�UI��@p!D�H��2@�[m�%�o��f��2��_<�\,c}�����o~��0}����<��pM�?���t3�D��?�8L�4�����f��'/����G��eD���!���b�D4�/�YF�gDCDD/����i�������Gx�����(��"4!�
��j6�l���2�w���Z�I�;/�7�|�y.
6��$��v�uW�j!�!F���`��2i�*��Bl��H	�"���Q�K����O�/�J�q8��o��|�A{v��}�}�_�E�(���;.�����~e�5��r�!�#���=Y�D-d���a�D
��������������x�/�]����@w��0�� �Q<;�h@4k�gY����_��7L�O�����`��
��^%x���~RN4������7�b�e��%�\f��e�6�F�3j�M7�����>��0h����<��#��^{�e��Dt�w�%�xNz�-��KH�Q<v�.��������[o�8��8��>��9���,�:������P���0�oDED>R�G����%��XF���#\�{Qz=�&����L�f9��]���j��=���[o���f���Z���g�a�x��;	�y��N���������(�%�!>r����K��K�Yoou]����������@��i�r�VXhW\qE�b�-�1�1q�D�DA��w�������E]��������xA��O�F�fx��������}��
�B,�����BA&�V�`N�-<n�=�\�y�]wBG^�<ex0N�^�p���+���K/�|f�%�H���C�A��@x$oY�X!�r�!a���#F�0
�<<����WK�g�$o`�a
bBa�#�7�C��kT��Ld`��g�}V�������X�:�w��6`��};���P�o/�v-~�($c��]����������@S�i�rSUR��
	����U��6��b��1�{h�:T��L��Cq;��;��F��z�w�}�<lV_}u��I{��G�0��s��^u�<�+
�c�
6�S#�]|��]��Z�aAq���2�1�YzA4Cx:t��CD�!o`^1��K_�R�C�I��B���{���-�`���y��`��?�ib�!
�cx�!"�,D4�B"�w�i�����/	�I���9��w��W�g'yc/��B�<%�*a>[��l��5���Y��]����������@�	���x��SD@D@D@z��h�wy����<P��`������
���n���0a��z����N��C��{!�����K/Y�S���1�B���K�$�LB�yA!�$!��M�w!t(B"!=�����V���B�0BD���t���g�}��=^���^���g� �'u������e�w�}����K�Z�w�pOx�!�f�P��v��#B!"�!����
����^: h~��&:Vu�
6�<s�S<�	�����{��q���.����3k�%�����.XD@D@D@D@D�Y���x�Y��z��������h6���[����[�)./.�*��������7�xcXc�5�S�;�=Dr������F�/�[}��a���������J8����������Kl�w(��}��g[! �q/xI�' ���(���>c��
y�'���u��GZ�Q�K�'�]���/���l�]v1������1
1��lx�Q�)S�������kB4�F)�e�P#W}��	��gqO�=���������b"�Ad�o+����@�s�1H�G(n����B2�k�u����������4�x�/7U%U�^C�0�x�B/Oq�'���"#�������m�����*��|��������7q*����	��w������X����9��T�� �L�:5��r�M7�D��8'��~�:����/�g"�h��0��*�#<�*�8����=D�[,��U�	b��x�6��DLC�ELC,s��J<���I[Q/<!���m��"���{"�!�"����d���/�4h���j��Z�����j,�SED@D@D ��0�_~y�x��n�����A���b�-�+^���-�<�����m��j"����[&���i��+%��=��_x*!��\hx�1��`9x�`;$y��5G���w���i�'F��{��'����F��U, u��x�4�x�8�����$�a\�x�
���]w�5W�2?������/����W_
?���B�>}���n�\,��k�9q�D�(C�K��|��g�O�����+�N8�<%?��sI'�Y,���~�|�x,���`�$�)" " " " " u&�$���N'" " mFc��.�]w���r���'�k_�ZXa���N�����W<�V)���]x5!� N������w���'qA0#G�������2!��$����\<����{��guX{����X��R�C�>U�B���~��a��1a�w�r��{�<����l���|_��YGn��#Gv9���\���x�!8r�,�L���i��K��^���I�SD@D@D@D@D@�L�i�����\
�ND@D@D@D�N����V[meg�D4�e����>�vq!���I���*^_��[n�%\}��&��S
�l�
7�����(��m��aa���/T)�M]X��9��W^VQr��A�^�p�xF�x���0��r��'[�Q���p��G`�|��w��K�Y��^��P|��2�a���2�<�n��hz�%�p��Kl_�R+(���M��O�:���3K�����A����{�
��=<b������v� 1��]��?�$$5j����6�MO�����������o~������6� 3�}�Q�7�������/���P!d����>�wzdM��\Rafl�8�x�q!��;�\X�3����������K.i�\x��t%����[j��������@o%�%�*��e�(=�P8����������z�~�/��u���f���[n�e�����3�P�@yr���6>6�+(���:�(���Sg�\�����9��2[��+��t�If;��cC��};�EDQ�Q��@��K��|��D0�w�q��w�1��E]4�@���������E�����
/z��<�i��[�^�Y"�6�)��W\��G8#�J�����0����^{-�1����f���/l	Qy�gm��W��5��~����&S���w���n�������^j��8jVX	�4����	�AL��7���c���_{��H�����=K���	/^����?�?��LJ�����O"���[o������@<H���M�(" " " "�b��	ix*1q�]vi��c�%�%�\by������_����n_��W�V'�3bGH�Yt[q�!=������v���|���2�%d%����m��F�=��������_OV�3�)�m�]���X��b~��BVR�#����Q�Y�/	f_��R@�gv���nka�����j���a��vj���*�����6�`K����U������\sM����4y)�rz��\���s����3�f�v���7
���;���M0����-�7q��������_�.���k-A��o�)��I�x��Y���Ix�M�:5S0��Q�A�]����= ���w91��No�F���*��>[�����/><�����!��'�ff&�b�-s�6]���6�����&���9�c�7b��� �pD�9�2�2�f(x�a[@��{��m��I,�@8�������f��Yg���C8K*DE����yN&���g�m*!=�GzR����o9����x�H]��g��a]qJP����[�o^$�9	}6�~N�0��pb�R� �6�7U�U����>�\s��2n�d�3fi���;7��=�6hj��PN��N��C���*�O����o��C��{/Ry�5g3���1�,������b5���y��cV����H5���
}�f9���?��mM�U�����x������������@o"���o���[��bL�q����d����7M%��N����gy��L��2e�yh��(J$-���;�i������CVbc��>��c�������>��	���g���]w����{�EY�7���'�5��`����x9��0><��w^S�1�($cS6Y�*UL0c��{�m�3�FhN
9P�bw�����0�clG�>��C���X�����Vl��/�j��s�=�ra1�
�C	f�m����c�W*�v�a��2f����]���/nC�����a<�~���.Tr�������?��,�|���6�B'� -^�������Y��5&7���A������?�(`�}#>��n�����:�X�
�W}���������O�	a
��f~^�#��[��F��s�9����y��'6��R�:e���y���Y+�^���.�����+��Q�Uz�b�6b��?������]w���z��}���7������>����E�)M��Sq�'�.�*�M�����1���������}�v�fx��~���I|��'�!�Nw����������*�^z��=���q�������r�1��B�|���-U�J�������[j��������@o&���>h���wa#���P82��Y�0>����+�0o7����"u��B�W*�������M�M�
T�S���z��`�����\���?n9O>��#M�����3�&��C�u���`FN#���p������x(��d�$�\~��[�B������#����n����-gFS��Fs������C��T��@��+�/#�
S�/.�����kX�`�}�H�@N�aV���$���XRs�c�x�����/��M~Q���A_!T�����?��D4��?��4��� -^n@UtJ�D���s�9g�u��G�h��H>h<QI�E�g��(LjFHK�k��_+�3k������rmXGr��$��r�X	�DN+�������	f�`��w�6�d�0v�����s�=����($Uix�!�]y��Z�D���������/��fx��P��	Tz��I�@v�AY�Y���V$�=��3�_�I���*�CO;���'�^{�U��;�9�4�V��0k[_G8����z��$'�	y�F��L=px�_z����>#� -^�6�����������@�	���*��O�H&���zW��f����l�"b�p�g$j�Jy�n�0+�J[4���_z�%����d
^*\���g*�O��`F�Jd2�/"j�]v	�m����g8��#,��E]Zh�B���F�k���y'G-�H���~��>u����_��-�n�
�EC	����S�7	�	��E�t'������T����W��a��|��n���p�)�����vc�Fx���t%����[j����������@#	�hF$,��3���[���Dc��
��1����g�zW�%��5���YK6e��4B
y20�-��"v��� ���^��!����t<���8�yfn����N�vRi^^	����*����(�2^���`D���K��-�\a-��J�������Wt����g�y&����a��W�d`���r�	����E�*����b���X���Fs.�k�R��������A�o|�����x���g?+u���.���mD." " " " MJ�0�DV!�F��2z���	�?����n������n��.�'�ymH����t��O
x
><���~����P���jx�����7�x���t�ID=����u�]V]u�0b�������X��O3���R�J��=&@;�t�I�v��m��)��_47��|�;v�3�f����Z�o����=r��B�����h*��_�Y��8l��k��n�0}�����9����5�X�������������|���SOY�+����3�kC����tl�Y�'O�l��4��1Ab��v�P �\gLz"��JW~� \��]���F g��a5�X�#�<2���=M�e��������9j�:@�����I�)a0��/3��$�Hv���gV��?�qA0��/��2�X��4�l������[,s�l��x�L�6�<��Ib�>A �{���~����"�8KSi���{���'����7b�]a{��v���u2$��BnB��e���[��q����4��rm���w�n��� �o�O&��2�,�*���~��w�o�m�3��*��AZ���������������4���W\����74c|�_|qx���$��P�X�I0��6�?�t$,D,!4#�*�E�����d��*�E����g��<
�E�ZW��VJ����C�b�J�'�wz�hc&��\g��������W�6���xh��{-w&��Fy�u���D@D@D@D@D@J���?E�*�����X���`�$�)" " " " " u&���:WC���"�5��`�V]@+" " " " "�L�AZ��LuT]D@D@D@D@D@D���I0�m���h� -^n�PEE@D@D@D@D@D�	d��$��`C��" " " " " ��@<H��{���*D@D@D@D@D@D�9	d��$�5g[�V" " " " " m@ ���mp��Dh������5�N," " " " "���AZ���\t�" " " " " "PKY�/	f�$�c�����������@	� -^.����!�����B��" " " " " "�]� -^����������������@yY�/	f��i�	�x�/��d:��������������I0S���i�r���������������@[�I0k���E����������4#�x�/7c]U'�-��_�zK��:D@D@D@D@D@Z�@<H��[�BTah!Y�/	f-�������������@�"����u��h.Y�/	f��F�����������@�i�r!�������������@�	d��$���tB��x�/�������������@�d��$�����," mI`�����SO
����re����>�������;�����o����*���+a������OXu�Us=w%�5kVx��w�2�,��[�~��f������
���w��K/]����" " U��EGGG���������4�@����n��7��;��c��\���<g��#�dZ)" mB ��%���O>�i46h� �N�4G�O�j v�aa����{��^5�����G
g�qF�g�y�J+�d��O�S���O/��RXq��������[�M6����j�%B�SO=?���������:�/��p�}����a��W��1�����@m� -���j�XG��@����n��W_������������+_�J8���zv���U{�XEJD@Z�@<��K0s���@���.�,L�<9z��a���/���?4O3��N?������u\���~�N��8���x����W\p�	�?����,P��Z/" "P�x�&��`�DD@D�.�CP�bc�b��r����������O��&�������5K��" �$���
��$�)" "��Z	fg�}v�6mZ@�Yn��
u}��W�q��[o=9X�By�~�����~:0�Zd�E
������@�|�������" " �#�����[_��k�#����^P�E@�'�������>E@D@r!f����;1�	i���v������_|1����<�����=�������>w�uW`[�3|��@,��C���^{-������h6�|��m���������}�98���/o�c�����>�=:�r�-��7�C���d���L��z�-�kO���6��6}���!<��v[������n�)���udG���s/���6�q����:}������SN1��<�H�k�����?`{����~����O����5�o�-��r��{������#����k�0=e%�	�I{���O>���b�-�K,�������
�x�/�
]����@S�dA�?���0i�$G0�����#��w�����������9v[h��2�7jL�����������n*�U2r6����=��Q�rcT������GD��d��$�5[+�>" "��\0������f��ea�W�?����o_��_��&A�^d4x�}������N��
����������������/��n��fT���b�	'�P�R��b�
!��g������/�D�3f��Cn5��cy���3g�8 }��v�����?��a����{��"�� F���^�����^�/�xx����l��f����}tP�v��E]���/NK.:L�
01]��+a>n��p��w�HJHG�"���NED@��@<H������WD@D��T2����0<��c�:��c���*^x���X�1_���q`��F��|\�9�+�U:r6����=��=�Q�J���u����@��I0k�VR}D@D��	��t.c�UV11���_~�y'�����c����D0����l�al�@��SO
��C�Ymx�!��]������g�=�9������u���;�V[m��c���������k�i�:*�����o�@��VX�U��5�\c�#�1��?��O���k�g�y
�r�������v,�v�m6�r�
6�rH���lP���x�.��]�Cf��G?������������;l��f�,��#��J+�d�{��g��G��
1����2����q��s�=�<�L���c�9L`��9b#��K-�T��O�����6�n��W��ej_hI� -^n��Q�E@D@z
�J��
fx`e�������Qc:��`F����}%� ��I��=C�dLW��q��t=iM�+"�,2�_���N%�l��}C)YIDAT�TD@D@D�;��c�]w�8���;��E��~�iG"\u���n�hb�������gWa;����>	{XX�����D�)�c!	�h��=����DH��s�=;����t$�e����:|�[��H�.[�xS�����o
�[�0a�m����������|?}����c�c�}��H!�}"���n���N�_}�����o����x�&O���W]u����Q�/�;'�q��'u���l�_�+eN���[��������#�x�/�#]����@��t������Yg�������l����_X�5v������/�����������>��d����g��U��_c-�tT��1��tU6�6hZY�/y�%TTD@D@�#�5#��~�q�Y�CB-�%V������G
�@������\$T�{m���Zxp����[\�m�g�\`�Q>S/�
����_y��p��'[xGr}6��z��R�����'��������on����K�=��c�e���B���F^�D`����i���������Y1b�������G�zq����$�p��������wu9���,���J]
�D�����Kp��aa��Q��4��ND@z+�xVc��[�W�%" "�*CT�a�Ug�����f|Q��Y����J�g��t��3]�G�7/D�C������b������T:F��Z*" �N k�%���[U��&#Pj����?$���f�!�����a�R�`F�-�n!�8�[�d=�	�Ql�Q�8�@�3���{wlU�`G.�b����������/d�*8�o�,�H�L��>� ]�/�,n������r�zq
��^'��p�$�&�$a)�CBFz���a�`Ix��o����C���rK?�>E@D�m�;�wJ��6t�" " MI��1D��j���Zl�1]��l�1]5�L��I��`�PF>8�G�R���q�J����[H���"���W�.�0�dLC��" " �?��O:�C��D����p�$^K�w���i�D@���
�H�G���xh��0��VQ�$����C4fk��������L��Lb���<�@���X�z%y�:��;���_y����O>���Kr�u$9�l9��+|���d|���
��a�;s��)�0�q�L�F�" "��	���x��_��OD@D��	T:�`����1B\_���!�_x]���x�1��Rc�j������_c-�t���1��t�iM�#"�,��_�0K�������G�g�1+��~n�����I�����K�3�<��%B�y���Na������d0`�[��/��0c��;,P��V[m5VJ�b.��/6S��q�����������I~�L3B>��Ca��w�o�}t�|�"\���g�u��S7��k\y���w$�&���/�=��������a����y�y�
����������W3�:�=�*a���7������8������Z�^L ��/��K��������J��u'��|��ga���{���d�E���\r5!�����y]+	�_nL������������}��d����Z��*�jL��V�~" �F k�%���ZI��'�.���k�m������0k�,�H�*��O?m����N�.D����/&��}O3�7��9����k���
3g��P��?�{,��jp��E�D@>�������,N>�����'�<�-�:)cc���A%������_���A`$�J+�d���?��X|���<��Lq�%�\�8�b!''M�d��s�y�d��������}Zh���6�X(�r�	AI��!�!�x�uR�<��v ���v�v]����@��fA�-rE�����0�S�N���L��`��1��g]0+7�[p�+����i���1���Q�;��z���_D�)d��$�5E��" "�{x�c������Ca���;�h�J|�'N4q&	�gb��#A��s�	K,��m���c�/�����G[\��������be��q����6��o���Y��-��n��F��`���
�n��]��E��?����6^�L[f�e������j\7a���Adb����OO�|���$�F�����P�q�rH������Gq��y��G�!C��W���b��v�	}�p;��1���x�y!�9��]vY_�Oh� -^n�P��$P����D^{�5�~����>!	a6�d�0v�X[_l�Vl=;�_�Mz��b������k���M`�dL��*�g�:U�����x=�1�/5F%Z��t����hvY�/	f��j����� ��>���/��>�����u}���r0�,~p3�q�-����b�!R�K|��w��@D=X`�.T���������T����^�=���'����7���[,�1���=Z����{��O�>=,������u`H�k�G�c�=��m�-�G ���f�})�l����N�]TD@D�]	���x�]y��E@D@��@�cj�������'���8�E��[��N�����qX�1��.�;k��)�Y�����d}������s��jLWI�i�f&�5��`��-�������@��w�u��Q�F�=���5�)E@D�}���x�}�JE@D@D@�&�1]�Du<��H k�%��7���ID@D@�I�1�Vv�\�KT��x�/���:������@o&�1]on]]��@���_��$�c�����@� <	�e$��f�mZ�jT}h~� -^n����" " " �H@c�fl�ID�	d��$�5cK�N" " " " " mA ���mq��Hh�����5�N+" " " " " � -^���I0�oYD@D@D@D@D@J�i�r���������������@�d��$���v���i�r���=E@D@D@D@D@D@��I0+GM������������@����x�F��aE@D@D@D@D@D@Y�/	f�" " " " " "� � -^nPutZhY�/	fm���H�f$���f���$" " " " " ��@��K�Yoi]]����������@��i�r�]�*," " " " " -D k�%���PU�]�AZ����RW#" " " " " �E k�%����H�h#� -^n#�T�;������7�N(" " " " " �#��e�����������E@D@D@D@D@D�$�x�/��I_��������������@��K�Y��jg�>�x�/w���SD@D@D@D@D@D������r���������������@<H��kt:VD@D@D@D@D@D !�5��`��!" " " " " 
"���UG��� �5���'�|�#��A����V��hYD@D@D@D@D@D >HK�eAc���" " " " " "P�x,��K0s��:�i��_�h[�X�!($��������������������/��:��������������������X�f"���f���"" " " " " ��@��K�Yoke]����������@��i�r�\�**" " " " " -H k�%��RU��AZ��;�NW!" " " " " �I k�%��9�J�h� -^n�K�%�����������@�d��$�5�9tb�v'���v����Z�I0�%q[D@D@D@D@D@J�i�r�]�������������@	d��$���v���i�rw���D@D@D@D@D@D@��I0+�M[�����������@M���x�&'�AE@D@D@D@D@D@�@��K��:�����������4�@<H��T�VD@D@D@D@D@��@��K�Y[4�.RD@D@D@D@D�	���x���:�����������@o!�5��`�[ZW�!" " " " "�r�AZ��r�
�����������@�I0k�TUE@D@D@D@D@z�x�/����������������@s�I0k�6RmD@D@D@D@D@��@<H����.UD@D@D@D@D@�N k�%���������������������xY|D@D@D@D@D@D@jG k�%��v�ud(I ���%w��" " " " " " ="�5��`�#��YD@D@D@D@D@�O ����?��(G k�%��5}/"�k	���[����C�
��?������?���a��a���a�UW�x�rN�6-|���a�M7-�����AZ��.M� " E	h,V��������:��iD@���m������p�)�����WT����?����o��v�k���
+�N>�����d���a�����/�dsm#" " "���AZ���������@I����/E@D@D@�@ k�%���u
�lO���p����O�0#�2r[�u�����ug���SO�SO=5l���a��v~�aXh��r�jZ0{��'������s���RK�v��v����?3��v�5�g������<��G���j���
���M,u$���:VA�hC�umt���2�dMo����X%-���������Ukh
Y�/	f��v���:'\�h@$�*g���#������u��G���^;��=Z���y�����/������zt���s�s���oD���p&�,�L��k���:�UShM� -^n��Q�E@Z���b����X6�rk[����b�Z���o�����U{h~Y�/	f��n���:W��r���WJ^����8��6�~�w����'�
#��������.!�|��p�=��7�x#,��ba�u�
���z�������S�Yk���K,F��Yf;����^|�����;|��_[m��mC=o������+�����j?������c3f���/�A�w�a�{��g��.�hXc�5��k�Y8�/������>������{��E�5���$9�k��fa%���g���/}�*9��/����>!	�����W��;���T���P�&O�l!1	����i�J8.��>�����o���'<��P��{���U�V]��{���	s�~En��9sf���������0�s�������o��f��{��G�|A���0�J�.v��]�T��|�J�M%�S�t��hv�����(<wYV�Z�X�T5�|�����b���i�c��4�x,���`�$�)"P���"?>h��z�E��}"��w�y�z]�
�N�r�!��q�&$!� ������^x���������^8�3�<�~��VGqDx�����+�l�|����,~��_�v�|���O�o��,�X0;����>����XF��-����8?���4k��0�|��9����,�����'a���.�����2Qo���3f�0q�k��C�����9��c�Yn��L\���\r�]/�_�y�����D�#
4(,���sr^���b��������x����s�3�����o��M5+������}�{�?~�������/��D�������g8��B��}-g_��d�-��R�r�I��2�h#� -^n#�T�:�X��������&.�)4��1|���l���D@Z�@��K�Y4��(�F�T�f�k�����C�[z��M;��3B�f�1�s�	T?����3���c�
����E]dU�Bl{z�q���&�����n
�^{�	i���NU�����[�G�Si�~�>��sa��a&�!|�7.�}��a���3�/8p-����]3Z\3����^{�F��iM<+u�I�&8-��������G��V�y��Q��6�(����&���G?
�?�|8��s�#�<������x�2�=;s�������n���b�x�{����qXG����>��j��v�_|q[F�>�/�'(x�������'a��d��f���?����Oo}��4'�x�/7gmU+�F�XLc1��:���u�Yh,���l�{��h6Y�/	f��J��49�jf%6�R<�#�_�rH ���X��5�\S��;����X�D��k��i����,�����YUx�����M5�Pw3<���2���$BDCdz���L8�����%�zh:t�]?�}�X��!����u�K5���K���x��W[����+
�b/�4�J�.6�����8��u�����[�E��I������H|��Z:����O�����D����O��1--����x�/�]��@y�i,�XNc�/�����"�Yh,��p��`+���jID�Y	d��$�5kk�^"���?��&�Yv�n:{s�B:%	����������5�G
�4<����^z)����-�p�c�S0��6���g��}�����u�Q���k�sn�]w
_������������VZ��2V p�:]�c�s���9O5�Y�>S�L	�^zi8��,$���_�0�^�Z~��l��j�{�;�%9���e��.-�]5_�����}�"����.k}��p^&L�n���p�q�Y^��c��G%����'xq��\R��iiYD�����x��8�zE@��XLc1B�k,������,X�f��X�b+��:�KD�	d��$�5cK�N"��Z!�����/����*�D�E^�����'�F���~{��>	q��B���`��
9�D��xw��f�:~�a /����Y��	���>���p���M6�$ x �!6�C�qNDB�1�B�������x��M|D�L�j��].��Y~�4�RJ0[u�)a�_a��8,#"w�O=���v�i�|t�W_}����6�,�����2^ex!Vr��A����'���E@��@<H����.XD�,��4cl������b_�p��k�����6Y��k�m��E@��@��K�Y���j$MO�B�x8F`���o�/���f�a���;f��'�4	Q��|_}�����Y
�%�����1c����~������� �!F�y���o>�8?!�8?%�c�E!+l������^���-����0~�x���&���n$������;�u�]������k��f���]�y���t6�no���y�����.�U=
���y�l�����]���N�����{��M7���	y���GAt��"
Wr�����'��41�-"�N�AZ��Nt�" ��X�s3����w4����(���b�����h#Y�/	fm�t�"�,*���ay����� �����������������q���w�m�4m�Q`�B�����%�!�����{)�4�g�Ut�
.l��bd�F���4C7�������?�D6��
60D6u���\���������5���X���T�t?L�����9�x���������{��3g���o�g�}��a�`v���t������G�6��|���^����Y{{�(N?~<���+9r���������M���S���-[���wS<�*���c���37o�L�?Nq��c���x~Y<4:�gq���[�����B��qm�x�"=|�0��q#���;M���.1��`����8�fr���_�?��������<�,�y�w�|���zw�z����g����x�_<s/�~��������#v��?�}+��9Y�����x������E����|����rr1�X����>?�������K�(�o��&�C������:�A`�F�������XbwU���Gb7M���r�J����~���}�vz��Q���c|+m��9�8q"�={��~���g;>|�y?���?E������q��t����Vw�c->�F1�����gb[�N���[_�E���b{��h���<}��7����GA���W�m�ql�������3�}>y���_��!�;��V��1��������R��(
z���k�o��3�K��-^�m4���?�S;������6jw�4���-^�~�)������W?�.��.\hb�������+�����9Y�����xj��4n��'>[Q�� �'i�y
c7F�����"r�b�?�r�e�����,�@W�����#[�:��1��A��Y
D��_���E�8���!���n��=�����m��5-�.�(�>��(����;w6��5���o�h����xX<�k��=i��
�m�v�_�q��w���A���8V�;��Ts��/Q��b�Y���f�l�(��um|>
q�_�nv�u����"v�E_Q(\}����~�p
�(h��8���!�F�������H�o���a������y�����;)�����Y�Z��~��I�����~f\�L���lE� 0d�����M�\l�+el�K@.&k#Z.�9������
V$� �y.��W����J����B����>���z������W��	��<IS0��5M���M�no���|��;��n�����
��X��[2�^	 P�@�����^�����@���w{=LQ�Z�M�b�t~>u�>H�E�������b�lp}�S��
��X(���K�l�B�d	�W��}d��q��}�L���z�F��;��n�ZX'@`.�$-?����@���M�no��G�O}�7b��"@�����/����� @��y���w_�� @� ��@W��`���6 @� 0�@����S4�# @� 0�@W��`6!�� @� ��@����}��= @��U�+�R0��� @��
�IZ~^t:#@� @���	t�_
f�K� @��p�$-?��� @��'��)���:� @���IZ~^��
� @�U
t�_
fU.�A @� 0y�������� @�C������Z�E� @����IZ~>�7A @�|G���K��;.��	 @�Xl�<I��[��	 @� 0[���K�l��Z'@� @���y�����o @� @�����/�u�j� @��	�IZ~>]k>E� @��t�_
f���� @�3����|]i� @��t�_
f�� @��I O����4� @� @`!��/��Xz�$@� @`�y���q��D� @��y�����eu�� @����$-?�n"L� @�����/���P	 @��/�<I���k�fC� @��a	t�_
f�Z#�!@� @`��$-?_ S%@� @�@q���K���2�� @��y����!@� @���	t�_
f���2 @�F
�IZ~>�C�$@� @��u	t�_k���� @� @���KKK�M�&��	 @� @`]����7�k���K�.��{%@� @��
\�z5]�x1��f��i @��hs������7������fh�����8(�\A���E*0DqP��.�A�T`���r]��
���A�
�,R�!���t!*X�C�+�BT�H��V(���E�kt�=iq��`��Snl�����{C^�rc�����8����8(g=�����W����A9�!�$��:��&�Y�'q0��)7���@�����G����.4U��8�`�
Q@��qP�"�8(�\A���E*0DqP��.�A�T`���r]��
���A�
�,R�!����V�Y#0IEND�B`�
0002-SVE-popcount-support.patchtext/plain; charset=UTF-8; name=0002-SVE-popcount-support.patchDownload
From 0b51becb0505d5bde5f8e2acc90a7f4f4b604fe3 Mon Sep 17 00:00:00 2001
From: Rama Malladi <rvmallad@amazon.com>
Date: Wed, 27 Nov 2024 07:15:23 -0600
Subject: [PATCH] SVE popcount support

---
 config/c-compiler.m4           |  33 +++++++++++
 configure                      | 104 +++++++++++++++++++++++++++++++++
 configure.ac                   |  15 +++++
 meson.build                    |  37 ++++++++++++
 src/Makefile.global.in         |   2 +
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |   9 +++
 src/port/Makefile              |   6 ++
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  18 ++++++
 src/port/pg_popcount_sve.c     | 103 ++++++++++++++++++++++++++++++++
 11 files changed, 331 insertions(+)
 create mode 100644 src/port/pg_popcount_sve.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index a129edb88e..eee9720931 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -754,3 +754,36 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_SVE_POPCNT_INTRINSICS
+# ----------------------------
+# Check if the compiler supports the SVE popcount instructions.
+#
+# An optional compiler flag can be passed as argument (e.g.
+# -march=armv8-a+sve). If the intrinsics are supported, sets
+# pgac_sve_popcnt_intrinsics, and CFLAGS_POPCNT.
+AC_DEFUN([PGAC_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sve_popcnt_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for svcnt_u8_z with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>],
+  [svuint8_t accum = svdup_u8(0);
+   svuint8_t buf = svdup_u8(0);
+   svbool_t pgTrue = svptrue_b8();
+   uint64_t popcnt = 0;
+
+   accum = svcnt_u8_z(pgTrue, buf);
+   popcnt = svaddv_u8(pgTrue, accum);
+
+   /* return computed value, to prevent the above being optimized away */
+   return popcnt == 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_POPCNT="$1"
+  pgac_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 199d666aa7..ab6092d3c5 100755
--- a/configure
+++ b/configure
@@ -646,7 +646,9 @@ MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
+PG_POPCNT_OBJS
 CFLAGS_CRC
+CFLAGS_POPCNT
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17653,6 +17655,108 @@ fi
 
 
 
+# Check for SVE popcount intrinsics
+#
+# First check if svcnt_u8_z intrinsics can be used with the default compiler
+# flags. If not, check if adding -march=armv8-a+sve flag helps.
+# CFLAGS_POPCNT is set if the extra flag is required.
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_u8_z with CFLAGS=" >&5
+$as_echo_n "checking for svcnt_u8_z with CFLAGS=... " >&6; }
+if ${pgac_cv_sve_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+int main(void)
+{
+    svuint8_t accum = svdup_u8(0);
+    svuint8_t buf = svdup_u8(0);
+    svbool_t pgTrue = svptrue_b8();
+    uint64_t popcnt = 0;
+
+    accum = svcnt_u8_z(pgTrue, buf);
+    popcnt = svaddv_u8(pgTrue, accum);
+
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_sve_popcnt_intrinsics_=yes
+else
+  pgac_cv_sve_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sve_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_sve_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_sve_popcnt_intrinsics_" = x"yes"; then
+  CFLAGS_POPCNT=""
+  pgac_sve_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_u8_z with CFLAGS=-march=armv8-a+sve" >&5
+$as_echo_n "checking for svcnt_u8_z with CFLAGS=-march=armv8-a+sve... " >&6; }
+if ${pgac_cv_sve_popcnt_intrinsics__march_armv8_apsve+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -march=armv8-a+sve"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+int main(void)
+{
+    svuint8_t accum = svdup_u8(0);
+    svuint8_t buf = svdup_u8(0);
+    svbool_t pgTrue = svptrue_b8();
+    uint64_t popcnt = 0;
+
+    accum = svcnt_u8_z(pgTrue, buf);
+    popcnt = svaddv_u8(pgTrue, accum);
+
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_sve_popcnt_intrinsics__march_armv8_apsve=yes
+else
+  pgac_cv_sve_popcnt_intrinsics__march_armv8_apsve=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sve_popcnt_intrinsics__march_armv8_apsve" >&5
+$as_echo "$pgac_cv_sve_popcnt_intrinsics__march_armv8_apsve" >&6; }
+if test x"$pgac_cv_sve_popcnt_intrinsics__march_armv8_apsve" = x"yes"; then
+  CFLAGS_POPCNT="-march=armv8-a+sve"
+  pgac_sve_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+fi
+
+
+
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
diff --git a/configure.ac b/configure.ac
index 4f56bb5062..3c0b8ffdbe 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,21 @@ PGAC_LOONGARCH_CRC32C_INTRINSICS()
 
 AC_SUBST(CFLAGS_CRC)
 
+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)
+
 # Select CRC-32C implementation.
 #
 # If we are targeting a processor that has Intel SSE 4.2 instructions, we can
diff --git a/meson.build b/meson.build
index 83e61d0f4a..7c927883af 100644
--- a/meson.build
+++ b/meson.build
@@ -2205,6 +2205,43 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of SVE popcount intrinsics.
+###############################################################
+cflags_popcnt = []
+if host_cpu == 'arm' or host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+int main(void)
+{
+    svuint8_t accum = svdup_u8(0);
+    svuint8_t buf = svdup_u8(0);
+    svbool_t pgTrue = svptrue_b8();
+    uint64_t popcnt = 0;
+
+    accum = svcnt_u8_z(pgTrue, buf);
+    popcnt = svaddv_u8(pgTrue, accum);
+
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+'''
+
+  if cc.links(prog, name: 'SVE popcount without -march=armv8-a+sve',
+      args: test_c_args)
+    # Use ARM POPCNT Extension, with runtime check
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  elif cc.links(prog, name: 'SVE popcount with -march=armv8-a+sve',
+      args: test_c_args + ['-march=armv8-a+sve'])
+    # Use ARM POPCNT Extension, with runtime check
+    cflags_popcnt += ['-march=armv8-a+sve']
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 0f38d712d1..523072a0db 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
 CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
 CFLAGS_CRC = @CFLAGS_CRC@
+CFLAGS_POPCNT = @CFLAGS_POPCNT@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
 CXXFLAGS = @CXXFLAGS@
@@ -769,6 +770,7 @@ LIBOBJS = @LIBOBJS@
 
 # files needed for the chosen CRC-32C implementation
 PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
+PG_POPCNT_OBJS = @PG_POPCNT_OBJS@
 
 LIBS := -lpgcommon -lpgport $(LIBS)
 
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 40e4b2e381..6baec9549a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -666,6 +666,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 4d88478c9c..d6a9aee00e 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -321,11 +321,20 @@ extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+#else
 extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+#endif
 extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
 
 #endif							/* TRY_POPCNT_FAST */
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif
+
 /*
  * Returns the number of 1-bits in buf.
  *
diff --git a/src/port/Makefile b/src/port/Makefile
index 366c814bd9..7ecf776069 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS = \
 	$(LIBOBJS) \
 	$(PG_CRC32C_OBJS) \
+	$(PG_POPCNT_OBJS) \
 	bsearch_arg.o \
 	chklocale.o \
 	inet_net_ntop.o \
@@ -92,6 +93,11 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# pg_popcount_sve.o and its _srv.o version need CFLAGS_POPCNT
+pg_popcount_sve.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_sve_shlib.o: CFLAGS+=$(CFLAGS_POPCNT)
+pg_popcount_sve_srv.o: CFLAGS+=$(CFLAGS_POPCNT)
+
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index 83a0632520..7af85c8111 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -91,6 +91,7 @@ replace_funcs_pos = [
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_sve', 'USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
 
   # loongarch
   ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'],
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 87f56e82b8..168bf24635 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -125,6 +125,22 @@ uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choo
 uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
 #endif							/* TRY_POPCNT_FAST */
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+
+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (check_sve_support())
+		pg_popcount_optimized = pg_popcount_sve;
+	else
+		pg_popcount_optimized = pg_popcount_slow;
+	return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
 #ifdef TRY_POPCNT_FAST
 
 /*
@@ -507,6 +523,7 @@ pg_popcount64(uint64 word)
 	return pg_popcount64_slow(word);
 }
 
+#ifndef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
 /*
  * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
@@ -516,6 +533,7 @@ pg_popcount_optimized(const char *buf, int bytes)
 {
 	return pg_popcount_slow(buf, bytes);
 }
+#endif
 
 /*
  * pg_popcount_masked_optimized
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 0000000000..04a08fbcc3
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  pg_popcount() using SVE population count instruction
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+#include <arm_sve.h>
+
+#include "port/pg_bitutils.h"
+
+// check if sve supported
+int check_sve_support(void)
+{
+	// Read ID_AA64PFR0_EL1 register
+	uint64_t pfr0;
+	__asm__ __volatile__(
+	"mrs %0, ID_AA64PFR0_EL1"
+	: "=r" (pfr0));
+
+	// SVE bits are 32-35
+	return (pfr0 >> 32) & 0xf;
+}
+
+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svuint8_t cnt8 = svdup_u8(0);
+	svuint8_t accum8 = svdup_u8(0), val8;
+	svbool_t pg8True = svptrue_b8(), pg64True, pg;
+	svuint64_t cnt64, accum64, accum64_1, val64;
+
+	int64_t popcount = 0;
+	const char *aligned_buf, *epilogue_buf;
+	int i, prologue_loop_bytes, kernel_loop_sve_cnt, epilogue_loop_bytes;
+
+	// for small buffer sizes (<= 128-bytes), execute 1-byte SVE instructions
+	// for larger buffer sizes (> 128-bytes), execute 1-byte + 8-byte SVE instructions
+	if (bytes <= 128)
+		prologue_loop_bytes = bytes;
+	else
+	{
+		aligned_buf   = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) + sizeof(uint64_t);
+		prologue_loop_bytes   = aligned_buf - buf;
+	}
+
+	for (i = 0; i < prologue_loop_bytes; i += svcntb())
+	{
+		pg = svwhilelt_b8(i, prologue_loop_bytes);
+		val8 = svld1_u8(pg, (uint8_t*)(buf + i));
+		cnt8 = svcnt_u8_x(pg, val8);
+		popcount += svaddv_u8(pg, cnt8);
+	}
+
+	if (bytes > 128)
+	{
+		cnt64 = svdup_u64(0);
+		accum64 = svdup_u64(0);
+		accum64_1 = svdup_u64(0);
+		pg64True = svptrue_b64();
+
+		kernel_loop_sve_cnt   = ((bytes - prologue_loop_bytes) / 8) / svcntd() / 2;
+		epilogue_loop_bytes   = bytes - prologue_loop_bytes - (kernel_loop_sve_cnt * 8 * svcntd() * 2);
+		epilogue_buf  = (const char *) buf + prologue_loop_bytes + (kernel_loop_sve_cnt * 8 * svcntd() * 2);
+
+		/* loop unroll by 2 */
+		for (i = 0; i < kernel_loop_sve_cnt * 2 * (int)svcntd(); i += svcntd() * 2)
+		{
+			cnt64 = svld1_u64(pg64True, (uint64_t*)(aligned_buf + sizeof(uint64_t) * i));
+			val64 = svcnt_u64_m(cnt64, pg64True, cnt64);
+			accum64 = svadd_u64_x(pg64True, val64, accum64);
+			cnt64 = svld1_u64(pg64True, (uint64_t*)(aligned_buf + sizeof(uint64_t) * (i + svcntd())));
+			val64 = svcnt_u64_m(cnt64, pg64True, cnt64);
+			accum64_1 = svadd_u64_x(pg64True, val64, accum64_1);
+		}
+		popcount += svaddv_u64(pg64True, accum64);
+		popcount += svaddv_u64(pg64True, accum64_1);
+
+		accum8 = svdup_u8(0);
+		for (i = 0; i < epilogue_loop_bytes; i += svcntb())
+		{
+			pg = svwhilelt_b8(i, epilogue_loop_bytes);
+			val8 = svld1_u8(pg, (uint8_t*)(epilogue_buf + i));
+			cnt8 = svcnt_u8_z(pg, val8);
+			accum8 = svadd_u8_m(pg8True, cnt8, accum8);
+		}
+		popcount += svaddv_u8(pg8True, accum8);
+	}
+
+	return popcount;
+}
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.45.1

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Malladi, Rama (#5)
Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find inline my
reply and an updated patch.

Thanks for the updated patch. I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27). The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code. IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

#if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
#endif

#ifdef TRY_POPCNT_FAST
...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (check_sve_support())
+		pg_popcount_optimized = pg_popcount_sve;
+	else
+		pg_popcount_optimized = pg_popcount_slow;
+	return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+	// Read ID_AA64PFR0_EL1 register
+	uint64_t pfr0;
+	__asm__ __volatile__(
+	"mrs %0, ID_AA64PFR0_EL1"
+	: "=r" (pfr0));
+
+	// SVE bits are 32-35
+	return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

#7Devanga.Susmitha@fujitsu.com
Devanga.Susmitha@fujitsu.com
In reply to: Nathan Bossart (#6)
3 attachment(s)
Re: [PATCH] SVE popcount support

Hello,

We are sharing our patch for pg_popcount with SVE support as a contribution from our side in this thread. We hope this contribution will help in exploring and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the "choose_popcount_functions" method, to determine the correct popcount implementation based on the architecture, thereby requiring fewer code changes. The patch also includes implementations for popcount and popcount masked.
We can reference both solutions and work together toward achieving the most efficient and effective implementation for PostgreSQL.

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used PostgreSQL community recommended popcount-test-module[0] for benchmarking and observed a speed-up of more than 4x for larger buffers. Even for smaller inputs of size 8 and 16 bytes there aren't any performance degradations observed.
Looking forward to your thoughts!

Thanks & Regards,
Susmitha Devanga.

________________________________
From: Nathan Bossart <nathandbossart@gmail.com>
Sent: Wednesday, December 4, 2024 21:37
To: Malladi, Rama <ramamalladi@hotmail.com>
Cc: Kirill Reshke <reshkekirill@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>
Subject: Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find inline my
reply and an updated patch.

Thanks for the updated patch. I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27). The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code. IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

#if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
#endif

#ifdef TRY_POPCNT_FAST
...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+     if (check_sve_support())
+             pg_popcount_optimized = pg_popcount_sve;
+     else
+             pg_popcount_optimized = pg_popcount_slow;
+     return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+     // Read ID_AA64PFR0_EL1 register
+     uint64_t pfr0;
+     __asm__ __volatile__(
+     "mrs %0, ID_AA64PFR0_EL1"
+     : "=r" (pfr0));
+
+     // SVE bits are 32-35
+     return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

Attachments:

v1-0001-SVE-support-for-popcount-and-popcount-masked.patchapplication/octet-stream; name=v1-0001-SVE-support-for-popcount-and-popcount-masked.patchDownload
From 5f8f3ca9b8372ad44f1c2cc6f1a4542359ae8142 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Mon, 9 Dec 2024 01:41:50 +0530
Subject: [PATCH v1] SVE support for popcount and popcount masked

---
 config/c-compiler.m4              |  41 ++++++++++
 configure                         |  93 +++++++++++++++++++++
 configure.ac                      |  16 ++++
 meson.build                       |  31 +++++++
 src/Makefile.global.in            |   4 +
 src/include/pg_config.h.in        |   3 +
 src/include/port/pg_bitutils.h    |  14 ++++
 src/makefiles/meson.build         |   3 +-
 src/port/Makefile                 |  11 +++
 src/port/meson.build              |   4 +-
 src/port/pg_bitutils.c            |  10 ++-
 src/port/pg_popcount_sve.c        | 132 ++++++++++++++++++++++++++++++
 src/port/pg_popcount_sve_choose.c |  32 ++++++++
 13 files changed, 391 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_sve.c
 create mode 100644 src/port/pg_popcount_sve_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index e112fd45d4..eabe68a773 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,44 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svptrue_b64, svcnt_z, svcnt_x, svadd_x, svaddv, and svwhilelt_b8
+# intrinsic functions.
+#
+# Optional compiler flags can be passed as arguments (e.g., -march=armv8-a+sve).
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[
+  AC_CACHE_CHECK([for svdup_u64 and other intrinsics with CFLAGS=$1],
+                 [pgac_cv_arm_sve_popcnt_intrinsics],
+  [
+    pgac_save_CFLAGS=$CFLAGS
+    CFLAGS="$pgac_save_CFLAGS $1"
+
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>],
+    [svbool_t predicate = svptrue_b64();
+     svuint64_t segment = svdup_u64(0), accum = svdup_u64(0);
+     const char *buf = NULL; // Simulating a buffer pointer
+     uint32_t num_vals_segment = svlen_u64(segment); // 64-bit values per segment
+
+     // Using other intrinsics as per the updated code
+     predicate = svwhilelt_b8(0, 128); // Simulating a conditional predicate
+     segment = svld1(predicate, (const uint64_t *)buf);
+     accum = svadd_x(predicate, accum, svcnt_x(predicate, segment));
+     uint64_t popcnt = svaddv(predicate, accum);
+
+     /* Return computed value, to prevent the above being optimized away */
+     return popcnt;])],
+    [pgac_cv_arm_sve_popcnt_intrinsics=yes],
+    [pgac_cv_arm_sve_popcnt_intrinsics=no])
+
+    CFLAGS="$pgac_save_CFLAGS"
+  ])
+
+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    CFLAGS_POPCNT_ARM="$1"
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi
+])
diff --git a/configure b/configure
index 518c33b73a..a3e41459d5 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,8 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS_ARM
+CFLAGS_POPCNT_ARM
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17159,6 +17161,97 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+CFLAGS_POPCNT_ARM=""
+PG_POPCNT_OBJS_ARM=""
+
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_u64 with CFLAGS=" >&5
+$as_echo_n "checking for svcnt_u64 with CFLAGS=... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+  CFLAGS="$pgac_save_CFLAGS "
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+int
+main ()
+{
+    svbool_t predicate = svptrue_b64();
+    svuint64_t segment, accum = svdup_u64(0);
+    uint64_t numVals = svlen_u64(segment); // 64-bit count check
+
+    svuint64_t counts = svcnt_u64_z(predicate, segment);
+    accum = svadd_u64_m(predicate, accum, counts);
+    return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics_=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics_" = x"yes"; then
+  CFLAGS_POPCNT_ARM=""
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_arm_sve_popcnt_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_u64 with CFLAGS=-march=armv8-a+sve" >&5
+$as_echo_n "checking for svcnt_u64 with CFLAGS=-march=armv8-a+sve... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+  CFLAGS="$pgac_save_CFLAGS -march=armv8-a+sve"
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+int
+main ()
+{
+    svbool_t predicate = svptrue_b64();
+    svuint64_t segment, accum = svdup_u64(0);
+    uint64_t numVals = svlen_u64(segment); // 64-bit count check
+
+    svuint64_t counts = svcnt_u64_z(predicate, segment);
+    accum = svadd_u64_m(predicate, accum, counts);
+    return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve" = x"yes"; then
+  CFLAGS_POPCNT_ARM="-march=armv8-a+sve"
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS_ARM="pg_popcount_sve.o pg_popcount_sve_choose.o"
+
+  $as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index 247ae97fa4..1ea314190b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,22 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM popcount intrinsics
+CFLAGS_POPCNT_ARM=""
+PG_POPCNT_OBJS_ARM=""
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS([])
+  if test x"$pgac_arm_sve_popcnt_intrinsics" != x"yes"; then
+    PGAC_ARM_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+  fi
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    PG_POPCNT_OBJS_ARM="pg_popcount_sve.o pg_popcount_sve_choose.o"
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+AC_SUBST(CFLAGS_POPCNT_ARM)
+AC_SUBST(PG_POPCNT_OBJS_ARM)
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index e5ce437a5c..6c936f2f2b 100644
--- a/meson.build
+++ b/meson.build
@@ -2191,6 +2191,37 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+cflags_popcnt_arm = []
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+int main(void)
+{
+    const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF);  // Example value
+    svuint64_t popcnt = svcntb(val);  // Count the number of 1 bits
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0; // Ensure to return a valid value
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE popcount without -march=armv8-a+sve',
+        args: test_c_args + ['-DSVINT64=@0@'.format(cdata.get('SV_INT64_TYPE'))])
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  elif cc.links(prog, name: 'ARM SVE popcount with -march=armv8-a+sve',
+        args: test_c_args + ['-DSVINT64=@0@'.format(cdata.get('SV_INT64_TYPE'))] + ['-march=armv8-a+sve'])
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+    cflags_popcnt_arm += ['-march=armv8-a+sve']
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index eac3d00121..2c32dfab5e 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,6 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
 CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
 CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT_ARM = @CFLAGS_POPCNT_ARM@
 CFLAGS_CRC = @CFLAGS_CRC@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
@@ -770,6 +771,9 @@ LIBOBJS = @LIBOBJS@
 # files needed for the chosen CRC-32C implementation
 PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
 
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS_ARM = @PG_POPCNT_OBJS_ARM@
+
 LIBS := -lpgcommon -lpgport $(LIBS)
 
 # to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..29c32bbbbe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a3cad46afe..57ebfddb7d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,14 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64 builds, try using SVE popcount instructions, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ */
+#if defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
+#define TRY_POPCNT_FAST 1
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
@@ -317,6 +325,12 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index aba7411a1b..c0207426c2 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,6 +102,7 @@ pgxs_kv = {
     ' '.join(cflags_no_missing_var_decls),
 
   'CFLAGS_CRC': ' '.join(cflags_crc),
+  'CFLAGS_POPCNT_ARM': ' '.join(cflags_popcnt_arm)
   'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
   'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
 
@@ -179,7 +180,7 @@ pgxs_empty = [
   'WANTED_LANGUAGES',
 
   # Not needed because we don't build the server / PLs with the generated makefile
-  'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+  'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'PG_POPCNT_OBJS_ARM', 'TAS',
   'PG_TEST_EXTRA',
   'DTRACEFLAGS', # only server has dtrace probes
 
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..2e04ea4d5a 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS = \
 	$(LIBOBJS) \
 	$(PG_CRC32C_OBJS) \
+	$(PG_POPCNT_OBJS_ARM) \
 	bsearch_arg.o \
 	chklocale.o \
 	inet_net_ntop.o \
@@ -87,6 +88,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all version of pg_popcount_sve.o need CFLAGS_POPCNT_ARM
+pg_popcount_sve.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_shlib.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_srv.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+
+# all versions of pg_popcount_sve_choose.o need CFLAGS_POPCNT_ARM
+pg_popcount_sve_choose.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_choose_shlib.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_choose_srv.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index c5bceed9cd..21d686a26e 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -91,6 +91,8 @@ replace_funcs_pos = [
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_sve', 'USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+  ['pg_popcount_sve_choose', 'USE_SVE_POPCNT_WITH_RUNTIME_CHECK'],
 
   # loongarch
   ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'],
@@ -99,7 +101,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
 ]
 
-pgport_cflags = {'crc': cflags_crc}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt + cflags_popcnt_arm, 'xsave': cflags_xsave}
 pgport_sources_cflags = {'crc': []}
 
 foreach f : replace_funcs_neg
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index c8399981ee..6b2e6b3794 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -135,7 +135,9 @@ pg_popcount_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-#if defined(HAVE__GET_CPUID)
+#if defined(__aarch64__)
+	return false;						/* cpuid not available in __aarch64__ */
+#elif defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
@@ -176,6 +178,12 @@ choose_popcount_functions(void)
 		pg_popcount_optimized = pg_popcount_avx512;
 		pg_popcount_masked_optimized = pg_popcount_masked_avx512;
 	}
+#elif USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
 #endif
 }
 
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 0000000000..8c9ebfc3aa
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,132 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#include <arm_sve.h>
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64      popcnt = 0;
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * Therefore, we align the buffers only when the input size is sufficiently large.
+	 */
+	if (bytes > 4 * vec_len)
+	{
+		pre_align = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		popcnt = svaddv(pred, svcnt_z(pred, svld1(pred, (const uint8 *) buf)));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svld1(pred, (const uint64 *) (buf + i));
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));	/* reduce the accumulators */
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_z(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred;
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * Therefore, we align the buffers only when the input size is sufficiently large.
+	 */
+	if (bytes > 4 * vec_len)
+	{
+		pre_align = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf) + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		vec8 = svand_n_u8_m(pred, svld1(pred, (const uint8 *) buf), mask);  /* load and mask */
+		popcnt = svaddv(pred, svcnt_z(pred, vec8));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_n_u64_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_n_u64_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));	/* reduce the accumulators */
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_n_u8_m(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_z(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
diff --git a/src/port/pg_popcount_sve_choose.c b/src/port/pg_popcount_sve_choose.c
new file mode 100644
index 0000000000..5f4e164f9c
--- /dev/null
+++ b/src/port/pg_popcount_sve_choose.c
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve_choose.c
+ *    Test whether we can use the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/port/pg_popcount_sve_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#include <asm/hwcap.h>
+#include <sys/auxv.h>
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+	unsigned long hwcap = getauxval(AT_HWCAP); /* get the HWCAP flags */
+	return (hwcap & HWCAP_SVE) != 0; /* return true if SVE is supported */
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

benchmarking-1.pngimage/png; name=benchmarking-1.pngDownload
�PNG


IHDR`�?sRGB���gAMA���a@PLTE������������������������������������������������������xxxYYYnnn���~~~iii���������ccc������������sss���^^^�������������������������X�s1��K��r�����������D�w7�����^�������s1�������t2����r0�������s1��a�������������������v6��k�����������K�������z<��q�{=��{�����������O����|>�����Q�����x�����������e����������z;��k������o�wD�Ml$5@a�i���������.z9q+���K�T������m%����C���l$���~��Y�a�����������X��t������r�x������*v5g�n��������M�x9���x�~6}@������0z;m�s���&t14~?z�����������:�D���<�F'u2���h�p���������R�[���%t1#s/���Y�b���v�}������������K�T%s0d�l���������H�Q"q-a�i���������������J���tRNSH��`�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������x���2CA	pHYs2�2�(dZ�z�IDATx^���������=��&�����Ht��FY�|{�u<�l_r:�Y%�(���p�Y����9��)�V�%�s��UO=�@� ��.��_���A���*����?�w������[��r��o�/��O��D�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?k���tpQ�~[1�	_(�"<;CkW���L�����L�����L.��:<J�����������������N\�����R�7��/�/��������+;x)���e�d�����\��?�����������?�s)�nuLl���b(��t,�/����������>��:�:_c5�nl�a��|w&���d8<���%��BG;^#��MZ�`j+�Wc��o�#��{�O�$�8q��
uxy�Oa �X����B�AG2���������������I�~vr��Y[hW��c�����"<��/T�&U?�ex���J��"��O�Q�$Iu\H�h�#9p�r������E%�����:���5���������28�7)?��}�.K��r����ym~������Y�>�I���/��)dw��J�RndY���J&����{����?��I��u��B��L��^�����Gg5����;�K/�6y3^��t�n����_���T���b��l�����w�g����]b�k
�;����c'�x�(]_�G���QM����p�R�i�.�L]�����+^�����S-y����N5����s���;^x�d��txE��m2�O������/k�P�"�L���}��*;��������OY�����?9i~�!h���06��
/�H�;�]O�~��
�Y����t[�&K]�~:y+^
��T���^��B]����*����?�.�]n����������m~l��NB%,�����,�$g��x��Z�����4�z���g���J�;��J���_�r�p��I���B]5����z���LG6>d'����$����������Tj�]��{vBt���GvrpH�hVHn��T�&u�#%�<}�$re��V��w���pU��m Wc.tl)u_�,y�j�:*f����#+���(�OY�����?)�{[��<��J���6�<��hP�IR��r��C�D_���]���T��t���N��	�c�h���DBX��f��z���/�uO(����!����;�
7�S���q�O�^+���=�JHI����������A�OqU9����{�B��cWU�z���*@y��n���dXt��6c���0�/>���9x��qP,��2���7�iiX�?���s���:����*��2=^�{A�P����*��,��u
p�����c}������Nzc�����s���{��u����~�`���nyyM�U��������,%�� �����6����]�}��_�/`O0����Y��]��;l��c��%����xbwA��X�K-9l��?�^*A�)�'�"���Y���Z��
����/�������0v�b����M�r���v�-�]�)x%�h_}�q�������C���j��������/T��r��|��
�p����|q�e���oJ�8���:L�X�?��u:��E�Y�Dy�]��-�A��������}@�d��U>��t�,���_�n������,�v}�������#����W
�Nvk���%�X�����VW���\l���tuC��S�r���^Y�����9R������vg�v�.����
o��I��V~0y�pg?x)�i�M�������p�.O�E�����m{E���=M��J�����	��|�*o��P����5��v��K)���2�W;.Ur�-����a���G�()la,����j ^id++�D���]
v9�*+4U\#�"�%;�$�S�'{���P>��A�$�TxI=BKi;X�?���T���������WB��<�v�����|3E�-�?u_\�4��bv� ~�����
�����&�Bu��7��?�����.���]��q�)�����E��z��_���H
����N���������8+Zd4�k����'w���_n��$��{��S���kG�r �I�aV�KU.�� '=�����Ym�;�H������Zvz��Kqr|n�J������ky��]��J�p��I���/N��'�{W��k�/���V�l�~?����>c�d��Qok_�X�Z���M�;�=�z��7������b�����>����zo�Xh�����������������`���a�������k�����'{e��������������-��g9��
���R�}W��������������� ��i�A�?BFuX����:������`G�}�Y+�k���\��`e!T���f
���E����9���_��+y��=��_f���?��n�|�	^���N+��p�%^���`���Fmy,3���P�
��Y��D.,'^�.���]�Dn������P�_~�����2��x��#��+R�*�7�I����cK1���@��.�KOu���8�C�\��9��+ZY���.��3o�H���_�|���e_��?��-���Gq����%)>&:2���6�,��u��o��f�qQ�-�?��8����qR�-��\�p��Yk\�?�<�mKi]��_;�o�y�~M��o?�������������}�}>���/�?�vXyY���!����R,xg��#q^��z�;��P^/����%���:����*��"s&�E������rL��H;����������W�+��U;�����b�!=_]��jZ�*��[s�F�}���g�'�oE%N*K�C���c��ry�}B9��������������m�u���*�y���������~���V�n�����VFu/��rf����Q�R�e$,�.-�Y��JS�W���O�����\��HE� A.��^��z��C��#ys�%��x����=9���RR����I�����t�����+lO�fK��B���UD\�{�d{����h���I(x��mV�����,�B����<�/<��!�s�����m���	�t�U������/r(f������o�+3�0^�#�����r��oP�h��S9v������-!C�R���}E1��q��,�)g�z��?9���)������^������v$�n S�2�����XR��
�b�������������/N�����'��O��<��Wv�"!������������d����������gd�=��u�_0���2���?�4�%����n*]{�/{J�;�4�d�������]*�W�bVn�f�����9�"��|z�E���3j9�����y�#�A^����_�	?����B"�r�Y?j�|��b��R_��6��#T�?��8i*^[��VXo����G��^U�����-�m���	�=��l	O�_*P~�D�n����Vf�a����j~�,,gU����n��lV[`�/GBp,[���s_��$�Px\Hw[OvM�9E�Qgq�&�R_
��p�#��dT�-�|�������2C�a�x��.EQ�_\V�����^<��)K���/�-�d�,u=����G>p������[<���%�G�����S��`�
�'���?%J;����x��%�mQRI}���������%�����n��5F���s���J��l��}�=��e$<���2H���3���x��}�[������-u$#������������������B��Kx�V��|���?
O���/N��������<X�� d��)����j�iT��o�}�O�_�0{�j�_���;����x��!JQ.k'���b�5�x�t�w��3��F�����?UeVxV�q�Y0ls���,Rg@���P,KFu��.��\Y{��{8�����d�d�Ct4�������P��[�',�f����y>"��|�����Gf���� �U~fdGf�v�`�����[�NX&\�7�)���J��]~�����2���?�4���E�I�5��Q�N���?�b��[��O�<W7z��|Y� ,C�O]�l_�]��d�H;R�y��+�����,�aK��<�g���\r)�c�������R�D�,1�&�L�������G~������ d�J����fm���x�D����&������L/����]���g�]��CZ-N�|���4O��Sj7v(+0�_�|!s�I��){Ek��6������~�)�9�N���V>�`Z/�v���`��e��[A�����������s����,��,pW|�Ri���R�;�,�?��8�F�~>��[";U�������m���������ez�i`�v����!�u���'�a�������T�;�����w�J���*j��� ���[���!���
��A��,�^�0��y��}�[����%�u��'������+),���)���n�q�$�4<#X*�|q�@�rN��j���3���������_"d���U�~�D;��
��'6��'V�8)�la\��k�v����th��s+�SrG��Y-^�Zu�RM9r�y)�I>�������������dT��5����j�|��=������6��E�%����3��s����R� ����}�?�f���������D���K����ZO���0�
��O�?��V�����c�����a����v��-��*��E����8���d$X�+���0\%A^��#�|�e��[R6�g�����=���_I�>DfK�Z����2��JR�v�Q�O��T����rwyo�:U��9��A0R���o�����������S��B����?O�E1�3��-h��#�����we����Os���'Nr��x���12�A����i)�Yr�������(�:lM�N����+��n�5�'��B��"Wm�{)�6����_�o�%P)��H����A�������&���/�?S����_�0�A����]��Vf�a�����zV��R�k�����|�G���\�QL���J�#�p'{A�,%ml�E��s~T����7��P�)���������{3��[O���;�%7�{	���`/��J��~����2����_\H�����R<��S���P� ���&������t�n',�?���
?e�^����.���<���~���%1d��/_������O(N�ejE}f�d=�B)��'��?�T��'Y���"�^#�����v�� �:l�����$E^V���%���;���H���g�/T*-��'*�+%����/�Dv��n%!���� \���T�3��n��@���n�)�K��1�3a�����C��r��(��
pr2|<u�'�dC��?���Sf���L���~a���?��yG��a3S9d�;���]����2��B�*a�/�f�����9�n�qS�aA�����t�/���3��+�s�r�P����� �������^l��m�[�#���G���z���pAeiZ��k�w�k�:�@�H�Ixi��=���S��<rH�G?=�������f oW�4YA���)���GP�M%���"���A	n'x3�g�?��a�P���T���>��s�#��i9��-����:����&�d����Jf�4N����DW���0��N���x���.��
CE�\���}��x��6���R���"9��Y��-�v}kvXc��c��9�
�b�\V��.���	�L�b���.�:����#�\����]��(�d�;����Cv$��������j����M)��+����2O���p���,��������z���PY�w#P�����_\j/E��D����6�l��Mp$�o��I�n�A�s�fy�m_/���m/K�?i_j����
CpEej�����t��}������[M�d�S������a������|��&��+��hq�.�L��)��m��!%I�KW�����|��7�aS���Q1U��>���������!%�la�v�7�T��/BV�?�%�����_�8��
����Zn[��_���#�L5�-�?�_�DhQd�*1[��?r�z���2��"�|�-tR��s;V�O��������OEy��}?����|<������2o���W+�S~��n�u�[��s������|�	�^������+��v�#?5�YR�������KN���K���J��#���9�H����L��R�9�C{jY��A��pS��_t�csg��x���� O�~��*e���T��e�����l��>����sVJ��F���kQ���t�����6��"j��l�I^{�s{�W��[�V���Y^�R3��~���v^@��b����������ak������s��C���V�\u��.���rN�$g�Yi������O~��<d����5:�d�II���s������,=?4�����>�_�B��>�2�T�J��kSL������S�q����4k*�%����g����/��R�$tg0�f��,�6�����y�=J���'*���FFu���������m��8.����
������lN����b�^�Dy���=Y�W��Y~����9;l��c���?�	�j�s�;E���k�y��(bR���y���I)���\������4=��c��d�I����N�Iv!�rp��i�Q��O�3��t8�Y�;�LN���^��K����p����]����=�ba�~�����=��E�fV���87������G�8�F��V��42�/�������<^�l��0�8����{Y�u�����=n2<L�Rt?�7P����&���b(�K����&��S�~ev��e����������y�K��k�����a��?V����N^�&����[�]��*������o��.����i��nJ��$����)]�p��>�[�82��Q�_������0�w�2�pvd������U��X:f|qE+s��AZ�q7� �ljO���5��\�V&��Q��]�U�aP*��U��*���L.�Bx:���2��IXf���_~����y;�����u=��n���	�hrG�tY���F���=3��?�d:��Pr��k%218��#�3&/��';�������@���W����[e���ykj�f��"�B����XK���/njA��J����#������.	AI�]���5�):l�n{�����R��+�J�o�b����H�'L�E����N���|�Y~�����;�����u<\���l����_��V���PU��uP`��O]��2�����+�29,���yh?�s�,����h-����0���~h��TX�����`7\�Rq����Pm����|�Tqf��~��:�:L�������,2&�����R�.p�2M�E������{[T�������y�@�>��d�o�J��U���^�q9��9�7����0����X�����\�>%�kEw%�a�n������}�������H������n�'���{��,��'�WrK����_>O8�@��|/K�r+��}m��N���,��N)}�����i2o���"_�{Nqym
�r�J��<�_�5�LO���P�	�������������F�?��l���[L�U���_��K�������m&�@�>`I�,C;7�=��[`�������Y��r��8s�b���������������">���@��?���x��6��E<;>���
�`���g/mC�X�����@��?���x��6��E<;>���
�`���g/mC�X�����@��?���x��6��E��?����!1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b �����:8����C@�?h����!M)��e��(q�O�f~�����m<|��D�|�:���&I�L�M������u���_W&�����LN����|�:���&�����~�L7����Ml{�����������A�T�OU�}K���^?�o\����B��T"�~��DJ��\�1�)��Jj@�(^�ky�d|��m��Z���N��������|����"�~���\���	I2��g�%��:�&�7yX^���?�{:n����]XQ"�~��d���qe���>�,H����������
nt��v��K_&$�:��
e���
��G��I*J]Wb���s��tx��{�:�$��vtx{��<��^o��S��\�9?�����&�*u]�
� �rtxW�yb�He�ZG�\*��-�^�
WR�;�*^����A�T��wv�T��	d9:\M�����>gq]GH������$��t��g_�*J�:��A�T���A,�3<�����$}���Nzc^�H��
���]Lb�4H�~�>���Y���\��^z2��&������n�c�5��Q��~|��M�}����r-p�'&�~�������=�F���������,^�diP�1�H�����$������3��$�Q���y�������*x%���`$K:�|$�k�+��A��?h��RW.�����*���>[`��&���G���#/���y�]y!�������n�PY�Q#��t�H
J�<��G��I*K]	�lb��/"F������iV���������f6;>��)��g+���]P���v��'��!=�_���. �$�����}�v\����w/d8)h WQ&7���[���l,�2����ds/������*�m�����`r�6N�����{�hQ�q��@�������6G����4Ie�+9�?E �����;R�8�Kt�`h�����W��sv,��6�%gI�7������C?��#o�Y��<���:�%G3�-�g��L[��y��?h?�MRU���*�2��h�r�mi�*D�7�:R�?ri�h@��Nv�����������v�x�N����R'����?�7�Uq�i�?h?�MRU���i���@*,�|y��$�j#MhE���?��������o
����cGL��eR��:<��(� ��o*����~`H�(OC��G��I*J]w���bd8��"�vv������m;�����?�^�7�I}(�>���C�{��i�R�)e��K����� �A{���ny���A��?h��RwO�<��Y^��Z�	F�7��E?��������_/�#ag5y�`�D��p�K�����+eI:�m.�fA���4I�����.n�TO�"������X�DR��}�������!��[��H�g���^m������R����+�����9G��u'�~��DJ������8s1�.��-b���i�.��
tyQ����c���\�)=�M�`�?07�->�Q�z���gf���6'�@��I\���i3�������������>�-e�#���U���W��H�?2������T�ft��D0!�5�������T���z����A�T�O^��(��B�jC� ����A-yA����r-)��/E���vi(H�T]q]j���CMCw����SRk����A��?h�����
V��U�?y+��������:,l�YM�k�����.mW7�A?rM�7�<�W�����&�Rw���c]��k9�?�o0�$q��H��������L�aQZ�����?}N��V�� w#��2�����A��?h�����ZD�S�8��+B���S&��H��u����WK����G������~�U3N�J�?�u�����A��?h����"��@
jy���;��w}����u�����q���_eoG���u��v�wve����A��/u�����;A?�#�:��C���������L���������-�=.��(�jW���?h?�M2�������v	4��{���T�r���&�����Pf��:��,�u�����G:B���4��R������qa�jv����<L�������_;�%�������u�v�+�N�;y�x���#�$�K����I�"�`dG�����\������|Gf��D�O��.�p���������������� �~������na��D3��#�ky��	��vN��8���q=��.�����n��?r�}U�4��~���A��?h��RW^�o���V���'���n��]��������)��BI�������:����T#]�TX�� �������+0��'�W`���T��+62�W���l��)v�0��WG1��[����>����4��-u=�^v���^�a����&�+u%B���c-��V �+��w�=������I����7��W��q?��?�M�D[��
<��)�U�n�����z}i}�����&�+u���W������������vy�MW���o�s5����^�s��%�b�T�~����$��?[v��y=\7�$�����s��[�����&�-u��g��G)������IrQ�(�������+�=��ET����y����"��Mnz/�Y��-�9n�7�v�=��%�I��\��8�,J��d���6\�'9Mw�{���x�!w����cF��������=����������Q����cl��x�o�F�}E�E>����H�r�lg�w6<I��!���WY��O/6��&���,
c����GT����=6V�[��,�ua��r}#O���.���W����wB����?h?�MR�?�����]�)���%��q���g������4���"���j�����x�e'��d&���1���w�
GSp �~��d��1����2]n�?�f�G��?�������0�]�[�-m���7��������:H�����u�� �.*�@����4����>�����t8�H��7���ezb��y��	n}����c�'��u?���@f���X���J�v���d8<��|��q=�6���W�9�T����t����P~��^��f�r����7-G����^�r���-�k&��}�n �U5�4�e��
��s��/�.#@����7��0K�#������z|�o/��D��t�<��t�E�	�8]7����N �U�q��)���A������=���i@'�?�,�8��Alw�rW���V��`�t����?�m��__����sn�Y������N �]#��K�|��~������@���.KOs;�~�z�?�����ir��Cn�^�����@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� �f�O:e������L��W�D�bhF�h��:O�	�o�n����C��� ���?�4�@�50L��Am������m�U����tA����tB����tB����tC����tC���Vvtx�����-�f��I�c�?G:���a����T�tx!�4���������jT��M���6U����tF����tG����tG���V�u�r
�S�9�0����?T�K��?T�K�?T�S�?���������S�@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ����I�s��^��:h�����y�<l�@{5&v&?��N�VS��o�g�����c3��S��w������=�}��?�SZ�S+��F�M��#�:;2tu�$�2�D�������i�����`�����~?��N�$W:�0���[���Guz+uje�6����!�c�<7:���&��6���S����+-����`��f�����4������k6��������m�ym3��2I�u�����{��q���?�/�M�Vh��;h��?�I�����A��;�I}�e:��@l�A�������~�o��*���C�T3J�F�����$����������Nm����,�3�o'��~fV	�����s�$���Su>��M��r�����V��{���0�J�&����9&���4M��sx����QY����o'���9@3�,�|�6#��o������^�3��~����Y���o'���{���_������(-��?������\^X3��T~&�:��^�����/����������h��������t��{E�����iF��T��$���:/�?�l��������24�>�-�Sc����
�`��������nmm�x��������@4��?���!9����:��-�q���z�$�6�G���R��������k���@��g�`���:w5#��d���}�6��uk:>D��/>�����A5#F��S���M#��Q��e�,�g���O����
������1��&�P��{h����Y���p�|�l��*�����^�j���y��a%TC�����z \���7�'����5������-S���A��X�����!��;7�srg����Lk�/�V�}E_k�����V"X����g�__�K��������d8��$c	��������JU��G?�^�Dk������+^��g6��jJ��6y���Nk��5��?��[�ze?��5�O4+���|Q�J��]�;s�VB5&z���3I��A<
������Nm�Y+[A��m"X������A�cm5��o|���i_��uJ��\Y"xK�����a��������7se� `�^]2|��)��<��~{n�8"x�W��>=�����������s�������_#���c�?x����'� `)�����A��t&|�+A���w}^�j_lQ��V��/���3A�\&|^��������-���g^�An��Z���^��V�&���{DP���1��,���c�?X"XH�����1���g^�{RG]?O#�6u���no�P�F�k�����|��E�`��  �����lg����dxTH����0��W5"]@������*���V@	�SF��u�������@��eA���_����y��8"]R>���z�1��������v |�!���E=���]��9��D�� OM��y"��_ |��?��B}����]��������m�b#��.����4���=6+A����4��~��z�Vz������"|�C��1j"���Ah��������k�O��&!����s���A�p c���>����m��pFc>OG���j.����F�������Dh6�\{.�
F��� �lD���Y
����O�}�#�E����MP� �}�g	�E��r�����RAl�r���?�D�!|V���F!����6��~�Cz�W"��b� |����z�K�#���g���>���W�������l����'� <[}�|@��S�?�XD���Y;����z>/����{�K��"���}f~�|��Y
���3����Ng���N��=v�U!�
�X	v��D��-(9�L�B/��A�P~���y^��v��ks�S��JmGJv�uhJ���=�:x�����������)���x�:xn�������?�F� �lD�>q5%���t����M��W���o|����[��)h��"��w��n3���{��Sk;ce?������)���$W:�h���i������\&��4�VD�����������V�l[Hj�J����0I:�`o~M������t4X�/�}��x�������v7�tjm�V���b�i%TS��8��P���8��k�����W�h�?^���9���~��6�X����k;����?��q%TS�g��Ps�Yl���?���puT�z{D��r�����X�y6��s7��jJ�$��5WX�u������[<�>n�{�'��Z'l�j��V���������1�y%TC��*I.t���c6��m�����M���_������2��}R��m>��Wh��V�l�}�
�����%TC�g+I�{�7��l��������6VZ�6���E-q������!���$'z�ir���-��e?���1��?��vC��6X�V����%TC���%��P�����������M}������n�i�N�m���%|�M,��?�6|v��������b*��Kc�A�L�kAV���N�m���~�3�3��&�PM���'z���8Y�/��~��������
���_��%-���>	�������K�����A+�����G��u�e�>�����?�=S��p�W�g��n�*����o�\�4���w]q�.�����������?���������-�f��i��3����u^.�#��3�����3b��ES���Z�+�t�����1���/�n0z1k�������:+6�_�������wimg��_�m����������O�%���y�o�gx61����=��e�&����h�|�KS��M����~Z7�����%�<?�����s������]������o���bo�k� |Z����u�o��~w��1%������sty[�����n�z?&�6�G�\*kS��B����+��?��1�����4Q3��vy��a���&�c��� �6�L�4]����6�Z�V��2���o�B�����?w�$9���Q��`��3U�����������j���]Sd��9}�m:���[�����Po�T�����?�S�9�������M�K��J�5�odp���Q��=2_����}�}:���Z����o�_%%���O��!�30��d2����@#��f�]��.~����T<u�@m��.����Q���=+��_���+����+K�FkH��F�����E�R�f���uS���P��&����v����ce���_hyR���}U~g���j���O��s>��3��zb7���/~�������)m�W����Guj[ujm7ekV	�P��� �
��~�3�4o�-��j"�DP����OK	�����S��Sk��+�D���%T5*6�/c�p��_{[��DP�}�
C���Sk��+������,���Y�w��3"�
������������97��ju�����u |�g
� O���1���x�+;���L������F��t�;_����~��W�g�j"hF�]���K�������u��D0:�w��Y�eo �-���y5���5�7��SG�� ^
`J��#������?/������D�-����g��=.�t�#����|�;:#�v�z�k��Qz�?!wu&��3�w�0j"��d3���k�su����������8"�����w�'+��?:�?2���>/��?���l�� ��j~���?l����������p�=��[3�grH�P����@g��O`
�s�$z�g�$gwv�*I�d
B5�M"h
��l
��$�������"������7u��Dm@�TX}��������Tz�k.����DP�L�=8���>�Ir���y�v�����R�v��h����n����.�k`�K����b"h�g���Oo�}
N�A��O�3��D��Sc
�s�$������d�����L1_]u��4`������g��i:4������y�j"�#O�6��/d
��;��c��qy
�v��B~�'<
�P�����?W@z���O������"|�������ar�����q2���`I5���E6�����O���`iD�!~�Gs��'_#|�^"�,��qx`c���$?n\�����E�[o��^��D��f���oA@����?�D����� ��}��y����A?D�x��o�~]�^���=�*������s��Z���p����kx��b�^�������S���!�����XC�lM4r��*���DB��������I�������=��	���o�Z�j�"|����'�)?6�j"��7�J�h�V�?W���om��o��{��W"��!|Vo��s�$c�K��7`���X}��&���K�� ��<�;����!N�$���"X���/���y�5��^��� ^�O��o��T><}�9��?����J��j���:#�����#��1KM}�G?��D���u�Oot���}��?}9D�,?�>�#|Vc-�s�`���;/�?��*�� `������Tg�s�#v5r��K���~_g�!|^�����6FM��<��[C����&�����<�� �!|^��g�$tk���E����$����3�?zg�J:���9�8�C����k����K������/�*A@�'������?zm/�0V�n @������x��G����`�j"�t_��_��}��7��Vn�����i
��D���y��G�w��Y����+����|��:��:��lZ��n�5�O�>9���%�IrK�������������:�k��`�V�MK�4�:������@[�d�'^B�����������~K���~������N��O�4��g��s����m���?��������X����vk��7�2*������V��<���ZC��K8Sx��M�<��C:�u�A�\1��w���&l�qZ���Z����C������ya
��~����?/l�O�O�?�?j���U���vkm�W�
?Cs�g`�n�=>A��e�9;�|U5������i	�H��?[z�'T{��>I��?�OF����-������%|�YC�<���"��V��H�|��z<���,�Zy����e�Z��>Q5%������O<���/��4�Y�����e�f~������8��?�Ir#����~+�?�����@�������;=V�SZ./�!�s�$�Z.^1���������@K�������;V�-�f���s7N&�A��i��G�R������X�
���$Iup����k�u�V�u���:��3W�o����rmXKhD��&�����~4�c�9�����0-�P9�������D�sD����7�SboB4���4@��V�6��t?Gd�*|_��VvW����i-�]7D#��zX82�c�����������2�ZY6m{7��iD�����Fx��z�����x���j�tje��������A�_����.�_�k����e��w�n���.�&��WZ�S+��ES�?��;�/G��]��J�ZY6-bM����MC��J��1��;�{���-���e����?{C"%��6��7?���o�-��b�ZY6-�`�s�� 6�/c�p�uje���n
����"�5���I���������?�$�_�9��nY}��P���`����]��� 3�>z�d�C����yH�+�����$7:@�5���M�u�J����m2&���!��{g�x��K6����9����������@�bXC�l����5���@� ���a���ssi���o�Z+���$I���������3)cx������a�������`1��E�b 1�?��@� ������@�b 1�?��@k���N������PXS����_���#�G�����?w7?�?IrJ|������O�?���d��?�&o�7�}�������=y	����5i#R����$�P�k���$9��<z��$��p�5��Yr�����'������K�7����� ��������������
���A�r�g��]_�"�&I�����?��<���5!�5��v�$��[�?�c3e��o}.@G�!���$���I������x������mL%�[:�b�����
��E��GuNc_'�k���R������g2�I������%��[K��z���$�u�������|C�Xtg[�xX)�����������_>�C�ki`����@��1�a�<�:��O�3�U3����O�A{/I&:h�f������]L��Hm���	<��om�s`�4���*}L�	�*��[K�����X��t5�s?yW@[�#L
f�"���<�!p�~k��-��)��7�@�!Nm����^�Q�\���&�-��O��	>������?�$9=���=�z��Z�����-�f�f��c�%}��SZ��=������K��>��,IR�3���w�V�?��s����7����?�I2��E��^���E�l�5�����`jj<7:h��8h�5�O�"9[�:i����1L�{��:�gp�/@��O���l��e+P�
�����n����43��R�X�Xh��/������?W�
S2�G��}`�i��O���<��n��k#���G[�L�3&�:�Zk���=tp��f�=������y��G�iz�_��!�&ZC��z�U�\���g?�������������������b����f
������f
���O��u�J����C2&���!��;��xO�;�,p�)�;��?���S�ES����@�bXC�l����5���@� ���a���ss90���
Pk��s�$������V�?{&en�����@+����=��Ye��(���a��sI'���@� ������@�bXm����v*���g��3U#����@�bXm�����u.�����b�@� ������@�bXa����7��V�?,���@�b 1�?��@����m�\s������N��	�����_I�:	�^M��T��v(���D@k5$�&u�v���m��!@�5$�M�����$I� ����?#�9�:l��&:h�f��A�$:��3#w:h�f���p8�Aw-�J�-���	P��h`��Ir����j^����D�m����N�	����?~Wl@[5-����T����4���\^X�������o�����-�f��A��l��?���?��%��7�`������?������?{�$9"~���?�I���o0��1�c�b��C��jJ������a@�5$�M��0���?&~�SO���kF���	���"14#�e[�����?��!1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� �f���p8��a@�5*���$��@�5)v�L��?�	�����M���)���`������hJ��+?�I����hP��{�tES��d�������hJ�8�t�����@�b 1lx�|�wM�)-����������^�rya�o��su-#@�\^���`C�?��@� ����1�s��&n�!-�����)�u*���@��i���-���t
�����@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 14(n�g�������1�38L��@'��1�sn�g2���I��jJ����9����C3�������M��+;0:6C��@�5$vL���~���t���jH��$�$���������M����c�UC���tpiF�t�R���9�:,�vt�R��9}�m�: @�5'�u�wE�@�5#�%�����T�-�������W�g�����3�����aC��>��I��h�
���@�4#�����30����K��?<�
��!�����c�?��tLC���?��iJ���Z�����34�:�ZM�{(s���������>�
�o�~�������l<<��Q@�5(B�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b �v�xx������O�+Tw;i�^n�X{l�������-����������Y���
���n���U9���t����y�w�1Fvs��5���h�z�s6l�&��{�8����]N�G�������]��s�V+9��)�n6�vG��������f�����C�pW������j'g�:����"9���f�{�U���tR�F'��[��f=oT�E�,ep�[���*�����msHguZ�Gk��z�:��'�A�+e�hd#��sP^�kYc�A��v�|��������Y������N�tZ�?K������]��l��F���88������?����f`����Y�3�Z�}[���c[J����v#'{r<���g'���-���6C��V������������ej�b��m;�M���i�U�6���g/�������=M��������������O��Y�j@��2���{p5�-3x#C-3x<�;������;����y������C�=����C������Zl�#�X����z�l��kR;f���?W�f;���[��9��yq�6����n���m�
cO4`�y�%����W��5�m�\���Y��������9���V�K��j@�?�2;��6�miy4��������%h���_��3En�JS�{���9q�����������X�Y%�����c8ku;07p��$�Z&ut�������=�������-��S4���f;�u�Ul����J@Q��3�S��C�t����)���2���L���j��b6_�;3�u�2_�I���9��5�bU��5;��������|[Y����i6���L��f��wi[j�\�gf+���9�jg�����:���iK)��j]Q
�&��L�L�����5���e��Z�?��]sr�Op�a���_���kJ�
���Wd�Ld`P����F!�a6y^w5�UE��E��� l�=���Vy�`��/���OV�N�3��eO�6>t/
�g��/.��/A�?����RY�9����'��w6lV&���)*G�C�,�^���1������s����V��<�5���gB�����m��O�(��&��Yv7����l<wr�����E�Y������	���7~��#�����q����n��1}���x�����aQ��i{o����������?v�v{��n�
�����&�/���s�����l��d�:S������mXE��Y��l������������0���aod*������u���wPL�KA9���R&�-z���=������j�Ll����15���5��M�����\m~�=OrgO�<�N��[WF��;����Q�� �3�{���
l2^\�?fN/��j-�������}��45)������{���&�����c/��g{��Bo�=���},�����������=wW�p����Y����O6&�m����cV����a����mjc�I�$G�S�jo�lK�){m��!M�~����X��n��{��.{���N/7���Tm��I�V���d,O1�ik�e�F�f���^��i�g)[�0�Il�5ES�r���+��WvO�_i���ph��m��c�y�=�d���I&������d�L�N�����m<�o7����<=��6��=���`]���y�N�����4���[�`n���?���9u��6�9�Vn��������d����g���=��m�[���f [d���E���#��5Z����]��=���G9����v�����|��k���!�9s&�a������6�4�R�J�#�r'��Y��>������l���g�b�-R-��������Y�m������3�G*>�����.�9��G�,a���%S��q?_�s����Y[��W�������7�����4~�JQv����su���}�AqJe{�mj��2#~����U����"�������=����	���${��}��f��W�b:,����)r�R��S�mzk�/�7L-�����n5��ck��E�����{k��6e�l'�Y��=��lY�Vy�L]���Y�IPs�
�:�>A��5��<wj��#�;��W��-�f��2�r����j��,�6�T�Hm�;i
�5��
"�#��y���05ki���3eo?7g��9o&�`o�A�����"�������5�>[
y-nfL[a�*&�-�yDX�N����L��G���Y�Y��c�6�	�������lb�9��L1_�6�����>7�T:�>A���j;a�����.E)d��|��-|�����o��]���kys���f���j�f'.n���!����&~��n�:ES���)gTf����������)��2�|�m]{��u�����cK����m~����
�:�N������$rE7�q��84eSVZ]M�����6���o��n�����G�+���V���k�2d���^k��9aK�,��P�J
h�o6~�@�������J�W �Z���t�j@{�Q��32��V4BM�_a�{�e�'Om���N�����c�
��s6�!��7�������	���'�\c����w`����?K�[/9�8�m�o����j���X�p84�|2��pn8�� ���2����l�������0'g�h��{Z�?����][_IM�o���������bW  [��H0�4�*?�������N<nP�N�g)��������r)T�?���&��'~\�G��JO����l�+��!�f������B�%yG
j1���7�~���z�b�x���?���F��&m�Y�)7�4��7o��z��X�?�4��^����M�4=����������p���.�u�9v}vt����{r�q{���������?���q\���9S�-:�4�����|jm5��i����:��Sk[c{S�{P����A�q��� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@��#����@����O�u8y���;Z������H�$�������?I.+F�e�,g%9VX~�� �K���6p��	F�gh������?�H�:b�2�n�$���5�#�d�lK�Wd�u;N��q�?��e��6�=�p8�L������B�`#�?��e����x�k��H��?�H�:b��yH�c,�4���D��#�2�j����$9����s��Php����2�g�Z�,��[U�T����X���pe��Z�Q�+0�fT{�K3ri�q;��Ft�����_���[�)�o&��Nq�����eyX��B�g���U�n'f���pcF��u5���?5�z�:B�����F��	�3�[�����3��0���������S��Z��f����m>���*����S�G��#l}`����+q�u���3:2Cg��-�O������y�����L.�s���k���?'�b��kU��Y��Xv-�������>5�z�:���G��m������}U�X�$/��#�%�D�j;�9f�`J�������+C�����)���S��Z��f��7�=�e$��S�~U�X=�a���XO�m	{�����I����>sy#�?*�Wj*:h�}W���{G�+���]�z�l��L2L��U����~��O
�����e�$��;[^������AVw��"����?+Uk��+��<���SL���Y����^v��f/��������AA�T�_��V��AG�2�hJ2��!�t��<�2He�E�]��c
lI>U}��S5���}��]k����3M�=3�:��S�~��X=��6h�u��[:U��$��T�S�����s>��C78c��f��V����/��?3���S�G��#�2����l���(
g;&�]���^�O�.�4��Y��u���5k�L�\������g��U}j`��tDXF�V���;�M��-mg��32��q���\iN�������m�3a.�Z��$���v���?3���S�G��#�2�^�w�8������uU�O����8���i++������\��KM��m���\��-����~�X=�Q
3�����&W��	����-��3�s`��,�f�Xvn��)��wk"��k>��`~�d�7����#�a���OE�T)|���b�����9�<g,;w>���r����`7b4)��?�������#�a���?�AS���/]��D�S;����Y���e���n2�#�����0j�������?����~z�7o����k�����>x�����5k���T��:���~���_����#�amK^w���O�����^��JwO]�H�fY�7���O�1f,���u���9+�����������X=�a�����>��]Y�W`����-��Y�cs���W�?�)L��bN�����o���5k���}3��m��?�Z�~��V��AG�2:{D��z]�FNV[���;���V���D:�
~e��Y����5�Y���{����e���n�	����G0��OU�|��O
�����et2t���6}�����Z��6l�����	7������6�������O_���5k�{����y���
��s��W����#����$����Et���33rb*����������\jw�f�U*�gd&���-�m�-�9G���~`�r��f��a�\'le�Dz�78V�_��V��AG�2z���C2�$?�����-�k�Gj�a�^��Cv��2�+@r�`f��a����G��r��f��!��*7�:���k��|��O
�����e���^=�?�&�G��rm����L��������j���?���5s���d���zE���*>5�z�:B�����=�O�����:�C��:S�?�w���y�|�,U��n\q|�u%��?U��5g��i^��qu=t�9U�7����#�5W���8X�"���_D���_|^���C�f/y�o�*�o�O
,��Z�����5)��?@����`C3�?@���}��@@��?@{��r��1�����u�	�@��?@[��!~�T��V��xxH�74�����@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?�����4��j[���^�v����,�����K�w�.��mD3�Mt'���n`������[x��~�$:Xepn^O�:���$9����d���m��)+v�$:�F�%�n�6IN��j=$gDS�?���7���>CM�H��.U������{:����C��LY�a2Y�I�|?��o~�����^�<�I��D�9W�J��j8^�����G�#>����'��?�I�,[y��&��zL�G���~=���3,��������
��?��N�l5�%�u��N�����o�� Z���u�c}��:�i��3;I2Y� ����$���uL^������~��N�h��S�]g�J�Rj��
>��f����:�)��3��������A������@W����=�����l��s�$k� rW{/���|�������������KK��m-�T�\��&@�������.��E�vvuk��h��}wnu2[/r��_���C=�7�/-��?��\?�"��9��'���Jw}��X�?�~m����/�0�
��(���t����/}����B�l�O����������L������(��C7j�Ci�Od�!;��?�c'(�<,�L�����y������3��,�k�������"��n��MN��]#_���:�_�K�2��eo$kq��%��y�8*zi����B���9���|qy�d���&�H�����W�ftb��M�\���C��l�?q�wdg9{�����}����������^K����O��
>�0���c�������y�����}E_\�)�K/B��l�[3���w�ci����L�3,��N����^^3�&1y��\�j������?�$?6e������������ �J�#)���<V$�E>�v�"�+�zf���<������T_u%���<��fw�K��gSU0�R����	?�������&���a��.}"q��8�!��M���N�~���k��7���/�O&�)�5���}���������45��q��R]��������9�'��sn��??2�����P���������R�����G��-������yzo��5���������3��t��nz��zPx{�8��7Gf�:�]�N���~�v��
��9mYt���J��WS�����l��}j�@y}��v�v�U�`���_��m����e��3��$���O��v��N��%I��f�=?���x8N�N�{��m���T�����H�}�h�C�?������=�yM_^�9��R��� ��S�I�mJ��8�Of��|����O���a�|���y������m�OI����f�z�������v��$!���|���25.�C����������1$I����j��j�+��w�[�rN{>��h{K�k����Y�c�����8z����e��OM�0�d�I�M�G��yL�.����*��{�?�##/�
zgh���k�%w�g�	�G~s$�#���m;�c���z�������Q})[�"�-�\9c�k=����+���)����$�)�c2";A
�g�g��T6/A,�5[0eK�%�+}M�����9�����o��v.x��L
��O��o��W��#��]��z��C
���E*�4��d*�����g��VQm��9������n�����d]�M�l}�7�V��b7ZxI��K�����3�12��6t��nc�YY��O$��S���h�����Z��]%2�t^7��$2�
�sQ��O!?}J��"0�H�OVF�?
N~����m�����x��)�\��?��������;���U"�ao�-r��o�������E��9�X��������?�V:0���?��#/��\}3�1l���}���e�(YX�=������^4Y3�^4!+<��,�'N�~���i���5~}^�9��;/gL�.�:�) \�P�?����������Q>���L��^���3�������4��YV����E�S�������y=y!l�/�~C��uSsF����9M�;���L�_���l�E;����3�����a���Qg7��W�'�vlP���5���+?F�{��d&��}�?�e>D���x��&{Mcbm����w�8?�o��e{E�'��[������<f�+�����~9�L�._�������d���T�e�]1��E�f�6n���4��r������b����m�����i"������N�s�`��\��k+�����^�?�I2��M}"�|���1�?M�n���M������}������lN\ol�'��V��j*���e�����`��1i����(���y�2�d`f}v���5&�[����O�������m-3kU|�fm��R;������������f�SiU5I'���'��e�-�4LX���{���x/V���'2��S�s@�?M�^�����,A���d��";�������/�{�K�d����I�����wi{c�J��w\��K�������o��YY��0_���Q;�����O�Jo����c��1o ��v��ibf*��w��������D��st�@�4��4&����6������h�O��z���[�v���������M���������������|]�(H
	3-H�M�����zU�c�����&fv�������e��$�
(�����W~����������D��st�@�4�k���a�_���f[������;st��2w~\�EA��1���3�"��NVO���y���V<&�/{����xS�����/e����y�_�.��x>13iO�3�Hf��E��OE�D��U��W~������)"�FW�+��?����5�}}m���l&��2^you���Y�c�6O�����c�!�|��8{�?����*��_��eo�<z�_=�Gf�9�����^e����z����c�]�x��a�k���d�O$s{b#���O��Pu?j��~X��vu�/��G�)���{��9i���='������b����}��,��I~�]}������l!n���)��O���1�T9[��+��������n\� ��g�p�y���4�����^�?�T�x�H���4���>�q����s�f	+����3�������G(���S����)��7��?����<l#b�!/�Bn���q���� S���o�Q���r�)���V�io��������U���$#��+��*��[@+�w�����l:�p����$��S���+?F�{q~�x����y'p4���JO����-�#�������4w4��:+����SEyoo,���fcCI��><���W��E)���^j�����0{����l�Y����xB�,�#��xg�|A�L5g�Y���8�;���den�u�����������?�?�����t0��91��zV����?��Sl ����f��^�1��bU�T~"�>�/B!�g�}.��S�3�������#����I"}`�}t�k���{S���u�7��r��o7���!�- �I���s]M���WKJS4e�!��8�|�~��\�����(��&����U�:��=������g��s�������L�mBB�]c�gn��u�i����a�$������b���wYd���oN-���c(������D6��}2V��i�/�f���-?r/����/,��[���0��?�:�p�6E�6�{��<����}b^���1C����$1I���y�l_~��>!�t'�	N�'YI���1������7o��t�`����h�'~�U}aa�x=K�G�m�F����"m��3��&E�����l��7z�����kOG{���qrd
�y�c�2c�e�v=��Snl�]�O��=��9���������YP�(��OQ\����c(������D��1"#��A?��_���H������:�6�%�C[�9�f������+��������bAc���F�����e��J����������aV�1"�N�v���~�d��aJ�K��XY7C���Y�5�]wG+B�m���}�M�.�G���5�����C���W����'�Q�j!���n��{��P��U�S���n�����i�/����{���5�'������Y)v��~y��6�)���u���#��h���7'��P2�f���=!z#�]p��mX���/�[U��t��0���+��
f+kV=K���g����U�����
�
�(�������s��n_��q�}�����O��}��p�+�����}
A���'>�I����}�vz���}�yl/=|H��"��/Fz��!�������o�0���z�V_�����7,JL2���57[1d�T���5�����I�h7=����lkP�G/�����xk���%��;H���`�{w���-rf��#�d���k�3����n�^�/�l#��m���,��j_�l�v����S�v����J����|�L2��w��1,�E���;�A�����'2�T�����Al������$����y\�,b�9l4y��S������Alw/Q���#���c�c��k_���B#��At7/P����6A���F� �����dC~o]g���H�Dg*@SOEX��
�����c�0^�{"���R��
�^�>D� ���>�U�ir���9AXs����>g4�����7�����`�:�?��u�!Q�i�
���z��j���f�|�L�/�,>g�m�y�����h������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1�?��@� ������@�b 1h���~�O�/�������;���n�KA���IEND�B`�
benchmarking-2.pngimage/png; name=benchmarking-2.pngDownload
#8Malladi, Rama
ramamalladi@hotmail.com
In reply to: Devanga.Susmitha@fujitsu.com (#7)
1 attachment(s)
Re: [PATCH] SVE popcount support

On 12/9/24 12:21 AM, Devanga.Susmitha@fujitsu.com wrote:

Hello,

We are sharing our patch for pg_popcount with SVE support as a
contribution from our side in this thread. We hope this contribution
will help in exploring and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the
"choose_popcount_functions" method, to determine the correct popcount
implementation based on the architecture, thereby requiring fewer code
changes. The patch also includes implementations for popcount and
popcount masked.
We can reference both solutions and work together toward achieving the
most efficient and effective implementation for PostgreSQL.

Thanks for the patch and it looks good. I will review the full patch in
the next couple of days. One observation was that the patch has `xsave`
flags added. This isn't needed.

`pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt +
cflags_popcnt_arm, 'xsave': cflags_xsave}`

*Algorithm Overview:*
1. For larger inputs, align the buffers to avoid double loads. For
smaller inputs alignment is not necessary and might even degrade the
performance.
2. Process the aligned buffer chunk by chunk till the last incomplete
chunk.
3. Process the last incomplete chunk.
*Our setup:*
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4
*Benchmark and Result:*
We have used PostgreSQL community recommended
popcount-test-module[0] for benchmarking and observed a speed-up of
more than 4x for larger buffers. Even for smaller inputs of size 8 and
16 bytes there aren't any performance degradations observed.
Looking forward to your thoughts!

I tested the patch and here attached is the performance I see on a
`c7g.xlarge`. The perf data doesn't quite match to what you observe
(especially for 256B). In the chart, I have comparison of baseline, AWS
SVE (what I had implemented) and Fujitsu SVE popcount implementations.
Can you confirm the command-line you had used for the benchmark run?

I had used the below command-line:

`sudo su postgres -c "/usr/local/pgsql/bin/psql -c 'EXPLAIN ANALYZE
SELECT drive_popcount(100000, 16);'"`

Show quoted text

------------------------------------------------------------------------
*From:* Nathan Bossart <nathandbossart@gmail.com>
*Sent:* Wednesday, December 4, 2024 21:37
*To:* Malladi, Rama <ramamalladi@hotmail.com>
*Cc:* Kirill Reshke <reshkekirill@gmail.com>; pgsql-hackers
<pgsql-hackers@postgresql.org>
*Subject:* Re: [PATCH] SVE popcount support
On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find

inline my

reply and an updated patch.

Thanks for the updated patch.  I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/
<https://commitfest.postgresql.org/&gt;) so that it is picked up by our
automated patch testing tools?

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to 

use SVE popcount instructions with a runtime check.])

+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27).  The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code.  IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major
version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char 

*buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

        #if defined(TRY_POPCNT_FAST) ||
defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
        extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
    #endif

    #ifdef TRY_POPCNT_FAST
    ...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+     if (check_sve_support())
+             pg_popcount_optimized = pg_popcount_sve;
+     else
+             pg_popcount_optimized = pg_popcount_slow;
+     return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+     // Read ID_AA64PFR0_EL1 register
+     uint64_t pfr0;
+     __asm__ __volatile__(
+     "mrs %0, ID_AA64PFR0_EL1"
+     : "=r" (pfr0));
+
+     // SVE bits are 32-35
+     return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite
here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

Attachments:

PG-popcount-perf-comparison.pngimage/png; name=PG-popcount-perf-comparison.pngDownload
�PNG


IHDRlD��b�LiCCPICC ProfileH��W\�G�wdB����� "#��V�ATB ���-V�NDGE� �q�V��m��B�kq+��Z���w���?�{����������JsQM�$���`�������Z|�\�����/�oD�_sPj�s��-�H.��8](�A|�E ��@�B�|V�T�� ��A!�Q�LnQ�t�2h���	du>_�	�F�Y�L�C��'�P,��b���B�Alm��t�>;�+���i��h���#X�`!���\���3����*����U=K����I��0%V���$=2
bmP\,�Wbf�"$Ae���\�3��x�<7�7��
�aB�!���)�)m`��
q>/b=�kD���!�S�������q9C|7_6��R��"'������D�!}��0+>	b*���H�5 �����
��fq#�md�Xe,�D�`�>V�!����'�;�%�E���Y�!�\aO�A�a,X�H�I��'��"�b��"IB������������47z���+y3���q�s���T������x��xe6?4Z���.,��5��@������F��@&�!fxF����q����G����@�?�b��x�S� chL���B��@.��T��x��@F����
`��*��=?�~a8�	b�+�����@b1�D��
p�����8����=�)����p��I�3]\$�e���AC�I�:?��t��qo��q&np���+�B�;��2+�Q����+4dGq���1?����v�#*�\����#�����^��U���m�}���c���X�X�I�	k��+���{2���W��'���3_��2�r�:�����|��|����!�#gf��8��!b�$�q,g'g�����U��{a�}���
������c_���p����_86|��p��@!+Pq��!�'�}����3p^��P�A2�����\f�y`1(�`5X*�V����� h-�4�\W�
p��.������ $��0}��D�g��� �H8�$#iH&"A�<d	R��E*�mH-r9��F."��!�����G1T�A�P+t<�F9h�NE3��h!�]�V����=�^Bo���s����L1��q�(,��d��+���z�^�kX'�����8g�p��	���/�W��x
�������>�3�F0$�<	<�dB&a��PN�I8B8��.�k"��$Z����L�&�%� n&�#�"v�I$�>���M�"�I��b�F��I�UR�-Y�lBv&�S�r����|�|������I��xR�(B��*�J3�2�����E��zS������
j=����������Z��Xm�Z��~�j���k���s�S��+�w��R����F�Y��h)�|�JZ-����C�Q��!�X�Q���qU��B��s�����r�!�ez�&E�J����\�Y�yT��f�Ck�V�V��
��Z���I�V���B������h?f`s�!`,a�`�ct�u�ux:�:�:{u�u�t�u]tug�V���dbL+&���\�<���|?�hg�h��1�c��y�7V�OO�W��O���{}�~�~���F�����A��,�-�z����+[2���_QC;�X����
��������F���3�������O��0L|L�&e&'M~c��8�\V�,����4�Ta��������Y�Y��>���Ts�y�y�y�y���E��<�:�_,)�l�,�
��-�XY[%Y-�j������YZ�Y�����������nK�e���n��b����e�U�]�G�������;��y����w�A���P�P������X����b�����k������)�i���	�B'Mh�����������D����'6M|�b�"r��r��������������������=�}��-�;���}�������������g��A�?��r�v{uO��$��c�co3o��6�N�O���>����|�j�G~�~B��~�8��l��'���7\O�|��, 8�$�=P;0!�2�A�YPfP]P_�k���S!����5!�xF<����:?�l�zX\Xe��p�pYxs�.�^�e�$�1
D���E�����},�S�4vB����q���q��^�������`��HhM�'�&�&�I
HZ��9y����/%$���RH)�);S��NY?�+�5�8��T����^�f0-w���������iIi��>������t^���>W�A�\�',���EkE�2�3�ftgzg������*��s�����!�[���D����M���G�K�;*���H��0�1{F��^Z,���9s��>Y�l��O�7����6�����������g��5[2�m����s��0�+��:�t��y�s�o[�,H_���|���]���,�.�Y�s�S�����$-i^j�t����SW�Q,+���k��o�o���/��|���%���J�J�K?�����	�U|7�2ce�*�U[VWKV�\���f�������E�k(c�����~����.�[7P7(6tV�W4m���z�����U�U�6nZ���f���[���o5�Z������oo��PmU]����`���;�����v������vIvu�����u���m�{UZ�����������M����1�����;�v����������[�t�q��i�����������q4�hk�W��c��v���T�=�����'O�����=�y�q����g&��~6�l���s~���y����/�\��x�'�O���.5�������#�n�
��/7]����1���U����\��:����7:n&��}+�V�m���;�w^�R�����������_���A������t�<�0�a���Gw?"��k�S���g&�j���[z�z��6������z���}��������or_�K���?W����/��Z�����{��M�[��5�����Oz��������l?5�|o o`@���?0�<�d��.h�0���:Eu>,��L;����3�`q�~������[������M ��'�������SY��l�}����t�o��L����{�Tu��:�����9VeXIfMM*�i��D�l�DASCIIScreenshot5�U�iTXtXML:com.adobe.xmp<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:exif="http://ns.adobe.com/exif/1.0/">
         <exif:PixelYDimension>324</exif:PixelYDimension>
         <exif:PixelXDimension>620</exif:PixelXDimension>
         <exif:UserComment>Screenshot</exif:UserComment>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
W���@IDATx��	������Yo�EHd�� ��EH������*b�%�jx+$T�J�ji�F������XC^I4"�$B�=���{�3}�s��;sg��33�����l�9���9����9�����Y=��3l�]v����q�����kV=���k�-��������^�/" " " "W�F�Uu��W�5}J��������=�F��=$Z��@D@D@D@�@���#��l>mE@D@D@D@rL�Q�F�9l9���D@D@D@D �$�rMT��������@�	H�����\�`�5Q�'" " " 9& ��c�
ND@D@D@rM@�-�D����������[��*8�5	�\Ux" " " "�cl9��D@D@D@D �$�rMT�����@ee�}����y����U	l����M�65L�
U�@�[��4U�-� -Zd_|���e�&M�[�n���[U�i��O�nK�,�:����n-[����y���'�|bm�����{[�.]j��={�!���W���G��������7�f��y��m����Kd�)Z�t�=�����o}���j�,C���3f��
6���}�������:������'���{�����M0�W�B�OS%�%��II����dt~Aw����cC�e��q�@���n����U�V����
���];[�r�u���n����}�����y���|�rk���m���]��w�c?���i���|�Aw��k�I�����Q��b�0a��i�����
C������
�<\�`����l�����d����������lN�H�
=�[l��^"q������)+V�0D��/�h;���{��&����m��������^pb����r�1��^}��D����x����v�e�Y�^�\>�����?������u�Q	��I�@>�7Yj/^l�������'���?�����o�e	�{�x8
S�DX���qO��'
N@����GX��8!�>����`{���m����l�2����a����>D��\1���/��{�f�����r��������2:z�&[�nm-Z�0��'�|�M�6�>��3�K/�����������+�������#�<�Y���G����?��?n�X(�<~���Z��]�y�f�
fX0�x���6��������!C��g�q�����~�Y�|�.\��a$�����}��p��	��[������y<{���r�!��qcguz������?v���#�8�1#�{���v�e0`��������l�s�����
�?������}��AZG�w�+�d�����t���de��{����+����N0�aS�>��H1����L!O������bC}�2���s�9�pE R������O�^{�e�~�������+�o���t� ��d��9�>bm"=~�>*���Gy�����c������U�Vv�a��4]y"|�B����>�h�f�m\������O>i�
���z�{h�y�ab
�7NzqLa�+C��a~��������m�8t�P��|�����������;�.������<�r�e�}!�C~`����[omu�=>^mE Zt�	-�MI�e�����^d�����7������u
?�_~�ecX�s��n�A�L�LcL#�>�����^�:
=��8��h�����aJ:9����]�L�PQQ���m�������Nn{�
7���!<�7��Z��H4�4��t���^{�YL��y������	1�
���A:z�)td�t�@t�7����p����"l:S:C�|����;�6�Q�tht�8����+�t+�0b��NV�>,�M'���p�z�-���[]Z��E��nQ�|��q�-��/~�� A�|����t�P{�1�����
,&M����'N������t2����D����t��D�H����w��=2����?��q��^!$)oIN'O����D%�E��K$$��V3H��u�~�`�c�8�A��>��C��8Q�8�
`~)�*�x@�9���%�����y��
���e����p����s����L�b�'ON����N��E�.���EH�S���B��U��k�&=��G?ra`����K�����3�8�Yp~.��'���1���{�i��r�;G�����;��0�7��M',���
���g�iX'ptZ��U��;G���C\`!"\����_�:�SO=�u�x�O�!Y��q���|w"G�`x��g;��t�A V���t�X�F�r�
���Az�
0���#8�A"2�>�X����T����(�s�9����������=�\�*�da&�����a�C�3d���7t�XY�+Q�X�o����A]H�
���/tu�,�\"�aL��W/4(���G;�����~�����i��:F��k���;���Y�jd
�����:Km�B�y��!:����o:G}�����ZH��4�������<�u�X��CQo}�0|�p�0�3���c�$���s��X�WaG:���$��#�)��!��05��K.���o���]��4���_�����" [*:�V'��.��b'���K���\GG��4hP"��{X�p,Tt<�b�H��"��p4���
��k\#��5h������L"����~��_�t���u�[�� ������r���QK'��}���wx�����������Ex �p���P_���,��q��d�a��
�������/7���n�"�����]�����I��t��0*��s��}=_��?]6p�|p~���,�Ko��C�XSq\'��7��~OX��<$���<T��+��5�Q���n�Hb
�-TX,��[5}�jv�X<��?<G
>r���?��
�w��"[�t��o���w���������e0,��+".�9�������Q��~����p:t,��������I�	k�<O�����|����������������O]me��_�TiT��*@��Sp�1��
��R�a} ^�����V<���\'M����@��x������������d�4�AFA�����1����tTaF>���!�/�/DZ�q-��IK&.���
'F�e�c�q�����v��	��������
���2
=WT<�[,�X�}Xc�w�I'9��Ny��i�J?V�`�8����[$~���cI��o���g�m��-�#o6�]���I�?�B�@i��X1�.\oI�T,l�!s��M�7���C�"���[*:�V'��U�������Q��h�x������|��1\�u�����;&;���Pc��i�����[oY�:��&�hl������8~��g�����	�B�u	vN����XP�:�����c���
�t8t<p
:���.Y>��N_6�'���.�ut^8�1���*p���~�.\���K�/v��m���:C��&RW�aSW���S����g�9����/��8�Z1����w�P�G���� ��\���v���F~�,(B�y�c��H��h���;V�����zNYa9��?�/�������L0{��a���
zq���$<��l�Bz+[-��9��"K2ya��[
�j{����"���[*:�Vo����b���a,h4l~�����n;7/�w�^�i"���M�L��c�VDCW���d��bICL�9_���#D�0o�����d(1���B��3i����<:&}#>�t���w�W��?���"����"A���c~ s���[�U��>^��/Y3������	�X��
���y���)K?��u��w�q���p'�o:.\�~�P��dy@��*�p8�3|��!�Q�8'��~x�l2�'�q�����	3�#N�$��)���� m��ti���o����r'�t�;�x��_�tY�I��"R�p,�a���W�"Fzi����iX�x�`��?l�7���zx���#��0���BV����f��,�!n���_�#+��JZ�s��j{�
[�E ��[�K�IP oB0T@'��mO�����9F8����px��
/����S���(&C�!��a%@ ���wl����E]�:��q������~u%�����R����0)
�nD\�	"dp�&��3g�x�.��	��qL��s���
��@����������S�3����_#�Ta0Y����Q��|a�`�+|G��x!�����J��t�+b��"����KX�N��;���7�`K��TeK<�L��;�y�X������8��&���y��!��?X�(O��U�,�A@�1��3���y��`��bE%������p�>�~�������?8��N�sOd��?��p��Q�&�/3����;e��r��S��<��:1_��^��g�����]z�X�}�z�>=��I;�����4P����>mE LF��3�D 3�
gu�j+MU�Z���a���U��Z��V��UU7�U�O�U������X-*���C�����!T��?�O����>����R���	�Ym�����NWO���^����v	z�j��	��Ym5������W�%
�z���N�.�0�w��!n�%s\�q��/|L���7G&yHV����`�p�:��>�����
���!��fQ�r�j����u1�7E������8��3���&�`Z��ST�������O:�-�������SFaW�(��6��U�*w���Nq���~8�����W���-G��i��VU�������"���%5�%��5<�@� �N��-i+�5�t��K���a���gf�CxX�x���P�$����?���a+x�{R���~�+7�x��i�!,`
��^��ZQ�c=H6����d������Ct�c,�s��$�����,�4Q�[
���%?����>��a8�Q,#U]$/�z�[p?��p�������)*�p��x��'�Q�v��+>}X.�_�~����6����Kp���K�Q0�Q�������Q6������]�z��}���[]�t=�OE�de�.hXYV�OC���n�\:�-�uH%r��/a�GC_/�<�����A1��}���!O^!������cd!O��kh>�G���,��6��7V�_e��iP��E�I&n}>NND@D@D@D ^�<�x%I����Dt," " " 1# ��QrD@D@D@D L@�-LD�" " " "3l1+%GD@D@D@�$��Dt," " " 1# ��QrD@D@D@D L@�-LD�" " " "3l1+%GD@D@D@�b�i�i����~h]�vu���
�r" " " "P�bga���o����}�G}��x��r-�[D@D@D@��Y�����a�\��~�m[�f��JD@D@D@��@����3�8+��%K��3�H���.]j���������@1�������crX���&�{����km��e���Z��m����j�8���������s�c7��!�&M�������fq�9J������@^	�nH���?�1c�X������s�=7�@���������@,�D+++�b�6m�����#" " " y%��h,[^)(2�1�X�a�1/%MD@D@D@
B v�
BA��������@�	H���p�4��������������[�H�	6��9	����'" " " l�" " " "sl1/ %OD@D@D@$�TD@D@D@D �$�b^@J�������H���������@�	H�����<�`S���`�y)y" " " " ��: " " " 1' ��R�D@D@D@D@�Mu@D@D@D@bN@�-�����������������������[�H�	6��9	����'" " " l�" " " "sl1/ %OD@D@D@$�TD@D@D@D ��"}+V���^{�6l�`����u���F2���g3g�t�Z�je��q]" " " "PN
ba{���m����l�2��/Y���)Sl�����iS�W��N�������@(������w�W�\i'N���Jk���v�u�]m��AeT��������D(�`�Iy��������Z��m��'��	&��C��F���"�K�.5��D@D@D@D�t����K�����XU�6���?s���1c����#m�����9n#F����?�����G'E@D@D@D��	`��9���.\��n��v����bm��U�|�r���aR��Y3w�" " " "��>�i���G������/1�w�� �u�Y6}�t����Nr�Y���=���z��U��L�������M��P������)���/�X��E��`C�af,<������^��***�J4J�" " "P��
�5�N>���O��q�cK~2$�9?����n�Zb���VD@D@D A���r�5N�@q��Q�k��[�V�D@D@D@rK���^rs��w��6w���F��bea�%GD@D@D ������X��V�b�Al�D}���(�5�*�m�/42(�*�`�/" " " �0�Xc�C�a]+''�VN��������@��WC������E&�VdW��r!E���h��5�-LD�" " " 'k�6_-
�,lQTtND@D@D�`�r����X+����[��.B�." " "�����[r6�"" " "�'�!�r���\s����������@�����Ec��-���������������Y�-=N�%" " "�C��j[�/����D3�%�" " " Y`���jUUUn�u�%�[	��&" " q#��j�+
����������e,�]�-C�" " " �'�|5,klq���0d�O�-#\�," " "�	������	��~5��&��������@P���j�A��-;~�[D@D@D ����E@���[�t������@M��V�G��$�rER������@�@���6j�(�_-GuBs�rR������@9`��k~���rC@��pT(" " "P�4_���^���+(I����b�`�k�$" " %C@���[��l��O���g[�n�l����F����bHJ �~5�7)��^����e����S�N�����&M���+0���5�������t	�����C>|�K����m�����E�D@D@D@����5�4���`��^�~�M�2�.��"*�v�����������@n	L�<�N=�T�>��c\p�����%��`4��t����XU�6����~*++���n�����~��4x|�@D@D@D �@pT���5�Y���n�;v�UTT��'���������@A���jI ��t�,l}��]y����������3������&�������b'���G)��#������l�]t?\J������6��������c9�-^��(}A���j�+oY��W&J�����������w�"�`�6�$" " �O 8_�������P�\�hH4~e������@�`����X%�{�#�`�7:�(" " �I@��������a����;6o�<���u��Y�v��O�>��^{Y�.]���R," " %J@����`�z��O?mO<������	�-����6mj_}��}��������n�����{q"R�E@D@D��	����[�Y�8w�����sgk��u$�M�6��9s�_�~��uRD@D@D��	�@������!�`�j��^����[�f��������S��Mb�x*�R*" "Pz�bM�W+���J��l3�9a�k���;�����;!��?mE@D@D@�G��j�4�Q�F�}�+>9lX�����������p�m����GD)�"'��������CQ���^%J�{��m����+���X�f�l�-�,Z0J�����#/�H;��jl���@V�D�Y_�~�����t�R����
0 1D��}��!��&�Z�p�w�Y/:&��t����x��v�M7��&�h_D@D@D�ah�Z�p�K�9����'��������������\�/��2.�T:D@D@D�$	�!P�W+��Md*'s������o�����V�X�"$b�������@N	x�F����S��,'s�xA.��y������jt������$" " �@ 8_
�����B�F�!�/���q�M�<��j��l���v+E�~�/" " "�=��X�`���3�s9[t������>8p����{l���q���&" " EG��"�pz���P�r��`�����������(�����kl�([�� ��E[o��M�6�-Zd���������e��������@v�C�^�e��.69Yt@�g��������L�4��;��b��������@�������h��������k=���^'�Z�l�����jm��IkGD@D@D }��!P����P��r"����k<���?��ke[��q�\����!P}4Wd�7��,:`����������@�	0�e
�����+o9����\�>/����g��z��wZ��m���r/" " i�|�4A�����9s����k���n�!��7o�)��z��W����2&���y�l����T�V�l��������#��j%W�9�PN[�n��WxTTTX*��S��E�.�#�8��4i�/���)Sl��%��v�iN\
2:(5��j����j��J9���D�a][�|�]z��5,lL��Z%���[��.��w^��z�A�%��" " "P��C�~�Z��I�o9l�:u�+���V
��r|k4�c���O>i&L�������Ck
���_�D@D@D ��bM���{i>}9l���g��9��a�f�m���F���Q��o����BU�TS"�E��8��Sm���.5\p��t�I��1I���'��K�e%�z�!7�l����{��}����2��5��^�Z���U�l�����}{'����qcgYk��Y�-5�e��7�@D@D@�D��jX��X�|�<�/�h��4�_|a�>�������f��b�-�������=�������X�~�j�z������w�������1}�twO��r�-[��V�\��`HTND@D@��@pT����$��v����`&y������_Zee��k��X9�Pi���\����^����L�H7.��|�5�W�'���+���4�(" " "�[z�Zny�ch��|�����i������r%�����|5^w%'�!�3���pO?�t���;�����^�^�[�D�(fA����������`���O��������5Vw���&'" " �N�aO�����|5��v����K??�KU@���_�����Wr��L�
����tID@D�$ �x�������y��/�k�$�8���*Q��Q1[�ha�����)����#c�I%BD@D@��5��w'�|��j��9!��U�7nt/d�OO������os�"" " y&�_��v��!v���;��7n\�S��J�@N��<��;,mr" " "PJ�B���w�K����g~��?�m�X��P�\��5��Hk���
QB��`c�5�\c�-��-[&�v��7Or" " "P�i/�������yjl��G`���m���6n��*v�/q\j;��,:�;w�����?�Q
J��(?" "P���ii��A�Fj�������7��e�=�OI-]�4��t������B�Ws�R�P�zMG���r�k�Ph�1,Z�.'6V�2�y�%������Nk��m�3T�D@D@J�@2k����@�2�����eEkT�����1�CK}��6g�[�v��v�m5Z�����r(�" " �C �P�����D��eoZ��i5��g=�U����K\�b��`����{YnEE�I��$����@F�"b�?��5��\�	DY�i����&��P�����u�Wz\z��5,l<�h��G�����@�	 ��P�i�H�$
��i����iQ%��/�J��c1BpN[���E@D@D $��A9�8i���4��h�Y��c��%��G�ac����,l�������@�D�4��S��6
{�� �DY�i�f�M����l�����Z��gb)���,l�7o��?���v�j�f�r����x7����^6D[���j����H�D���}��X[6��D|-z�b]O�m���v���5i����v[C����6|�p��9m�>�����N��s�R�����������Z|
���<�b�\ci+u������~�z�!��6l�0��yk�o��u�����)" " $ �V@�QGY��;�|���HcX��]VC�����m���v�)��S������@��Pk0�%�$�!�T�`8t��i	���E+�y���n)�k9��{��PD@D@bO������z�3*�Q���p����g���:���������������i�)�|[���{��O�7aM����"��%���``��\�P������H#�j��L��6�(�d"����@�*�h��ZC�K��%�<��[	��LtFD@D ��	5}�=�%�H��Z~�!*6	�(*:'" "����6��5��\�(�H#gQ5^�Q�k��XL�I��V�M�2�Z�nm����$?�+" "s�������ZC[�|��	�r���g��6����[o��>��
$��i���������Z�&J���B	�r]�Y��K�6d�{���l��������	�(�F����P�E9
��VB-����`����l�gGw������@�P�����L�P��%����_v��jK�V��-]����(���'��1c��w����]p��7{�lI������[����X�?\����h3��7��*���~��e�x�5���5^27�������]��6v��Vn�>�:�����t��^2�-�L'j�vD@D@�& kZ��
 ��F�AkZ����Y�R��Q	4�Y7�t}�R��p�
�p�B����/����;��t�#" " y" ��'�I��j^�qK�������j���,�F���XU�6��������"M`/\E�i���|�4��P�B~�>����ED@����i�+B/�HA��O	)�H#�(���n��D�4�N,�D$�
TD@D cj#��
^�E�4"��5��$��P8'�V8��YD@bI �H����e�K$QQ"����F�a���P���`�?s�(" �$�L���
[\^�K�5-��4�[�C����V�-��T���@����A��Z�H#Ej�)���*���Q�D@D�	����pS%��^�!O�{rNC�P���`�OY(%" "��$�q�i���i�+��j���J��*����`�m�(a" "P?|�!��{�G��D!����u����:1�t�#�Y��:/Q�4,i����&l���I�E�b��I���|�:��"�O��?�&L�`;v�e��%��+=�������7',g�>m�M4���F������P�$�o�/_�)�" "P�V4�5��w'�|���P�Dr�������"��H#qQBM/��,�X��Kl�,%JD@������8��2d�������M7n���m���G��]��a�a����P�$�{+�V�������/���3��q��;w��{�����/���h[~�_������kT��6/~������>�`��P����EN"�`�	F"" �!Pq6x�`'�r���q����.�M%������%4�O/�
�(�]}���
SY(.���_~��H��p.���X�����@��"��@�����2*�V����U��_����@�	x!&qV�BJk�Y�'�3\��4�.���'�i�3]z��O����L)�)��8��Q���2Y�����\]��?+���H�y���`+��VnE@rD������?���c��9#���&�@��$�<���J��g�+�" HW�!�pgC��KW�����@I��E�<�K��G9+�" i��8�b-�h�-CA��z���,���D	5����)��[���r,"��3-�gu(��g��5����:>#����D	H��h�*[" �%��fg�e�=Z��?#�q_�Y�QB���Y��{�'8�Sb$�J�@�(waq�����aL�����p�����5�\�Y��,HW�P��~2l�����@�	x1����.�tZ)�?��D����?��Mz �U�PK��N	�(:%"?��3o5�J���a]��H�b}�F6d����%��
���W��|�Z9��!P_q��Z�d�DZ�@o��R��*V	�(*:�		�Lh���@�>�`C�1Ol�����������o��r���NB-LD��% �V_r�OD�N�N�:����m��	���{�w��]'������L�����\�����E��g�_d���0��4	�(J:�
�F�7W@U��dVF����[�h�"����m���5��7o���9��k��U�s.5<��A�^Y�}�<wE[X�����"��;����nb+q�@���@�Z�I"K������l���m^�&�q���	4j��
ba����n3f�����7�lW]u�u��%��)S���%Kl�����M��DZj�8Oc8V	�[0�������N�V����,����������j�y�8��M�M�������N?���C�k�����	�|�pk�b�m��em�S��a���T����ms*������^v$��\(�����.�/���u�f�����]�������#�<���A��Jmg����Z�J:?4�
�x�y.j����o�Uup�S�����MM��0�m[��������9o?�E�����?8��$��AvV�r�5m<?��uM��!�&�h�xAm�ES"��i������� fEY���D��}��v���[EE�=���a�������<q�D{��l���n�t���FbK����R���"Q:��d���7�����D�+&��@E��b6|������~�����_�9�Y���_��h[�Ee�{���X������U�R=4�U��L���J
]����9��c7�t��i��}�Qk���{���(#�F�a������������K���O=��w��O�TZ��wS��6��f�xVVy�s2} |���m�d�?���~���q}�w���m�NM�M�6��]������m[����4/�[��6=7Z�m�$�������V��w���W:;���?O�Y��P��:�`��U�aE#��,eQ�U��I��j���+;�����P���w��������l[���m�����^{��Y���#��U�V�����}������ys'�Hd�fu?���������s�g���p�u�����u��{������r�'+�;��3	����|�`y�9<�C����?�m�����3��I���k��>X����)w�h����#O������u�j��'7��Dq.�6}��<����@[=���4n����j=��}�[���.��`x�����J�w�u�{�g�����oN��t�Iv�-���e�l�����{C����Y��O��R'����'��H�5�0 ]��_9��6#@�,EN�`C�p���6�W�Qc�?v	��\���W�k��J��������ty�?(g>�\����@:
*��I`�����
�>wmF���!�\q�C{�
/NC��x�
��3��+���^y�+����5�����2�f�c��# �Vze��������[q��-1���������dB@�-Z�+" " "   �V��RD@D@D@2! ��	-���`+tE)" " " ��`�����������@H����LH�eBK~E@D@D@D�$�
]Q�������@&$�2�%�" " " "Pl��(E@D@D@D l���_(	�@W�" " " "�		�Lh����������[�+J���[&��WD@D@D@
@@����������dB@�-Z�+" " "   �V��RD@D@D@2! ��	-���`+tE)" " " ��`�����������@H����LH�eBK~E@D@D@D�$�
]Q�������@&$�2�%�" " " "PMg�Q.^��&O�l-[��<��5kV�=� " " " �J v�u�����^k-Z����g��q�J���%" " " i��`�>}�m���v�!��)��bo��vZ�'(U������S�N�w��mm���VYYi����K�.5��D@D@D@D��w���+��lM�6�
6�tWUUY�&M�kx�$�uA�u���f���[�n6g�C�����{R�(E@D@D@D >bga����m��v�����������-�DD@D@D@
@�Qu�U��E+N�����I���$���`+�"Q�E@D@D@D�&[����L��D@D@D@D@$�TD@D@D@D �$�b^@J�������H���������@�	���1����-_��^}�U��������t����XMS�L���[���m����x�
����m��wv5��Q6�M�f~��u��������O���`��H��*�l��7Y�^�l�kk:� k��M�u&c+V��W^y��:�(W�9~���������;wvm;����K�1�yC�y�����3��V�Z��������~j��|!��\�m�Rp���o��f��%4����/������[o�{�����q�����[�>}�'��ym�'����o�i=z��G}�	cD�K/���J�����&M��}D
��du�����Gy�	�7��^	�����z��q�?��{�����S�N��={����n_|�E���M�|e���7���w��7Z���]�vN4�
?Y�
X�<1���n��y�����X$�!C��{��g+W�t�����?��n��Yy�x���m��a.���~���Yc:t�����s4��/n���7�du����n������o���6z�h;���!��n�6f���N���������|fQ��������j�
J��s����8W*;l,���Gv�u�9���_\���~�Xzl�}��g�j�*���+�����#�<�_�6�y�ge[�d��q����_��+.����9�dO �N~��6~�x��������>��l��VII���������p�	6r�H{�����C����������m[{��'m��	����Cm��E6k�,��aH�g?��{XL?�����h��3��?l'�x������{�Sp��Sv���W�^v�����Q���]�Ah�3_��s��v��|�A�������e��1��:���/���K������Ur
C��_?����D����n
V}����A��n������}�]�����h��g�����n25�T�[�J��&��|�G�J��T��`B0CrL���a��Z~b.�Xe������|����5*r�%U����:3��>����iSn#Vh�����<5FK�����r���{o�e�]�!W?Q|i�qX�X���Y3��k3�b�IDAT��������OSA��o�����5V���m���v����W���|�<������/sO�����h����o(y�� \����"x89��s]c�4���������@�:�5��SOu����X�!1/�����s�V�>��n4"w�Yg�U��>��[h�<��/��M�/Oj�����0����"F��ccH��M��������h3%�
\���)�R�PFY����]����&�[�c,mz�Dn��Z�:My���E/���hb	��r�'�P?��_%J�u���l�2�(D	��W�" " " "�.��
���?(	��W�" " " "�.	�tI����������[��+ZH��[���OD@D@D@
D@��@���@m|�O������]w�e��Ms���!��)���_��#���������@6$����{E@rJ���;��&/��y����4z�!�>�>}��}�G>��_���)S_zH&og��?��������1}���_Y���"���_�~������s�����|��/��vs�r��S''��jH��}�g�xq&oE��4;���{������]�v��@|.�����W��9����x���r�w �MT�
��C����H+_x�����4�R���F����K������
�5���GD�4H��F9*"P2/X�g<������q����������N�!�p��F����S�����8�����k�����O>��}t�G��;������m����+V8K������G����N��_������}+2n:_`����n�����f�/y�������n�'Ovq�/�{��wwC����i���r" �A@C��Q���
�:��Q��w����(�? ?l�0g�C��s����=��N:�N9�g����c�=�}[��-^��}��o�~~����Ok.���KXK�,q����������Z�2��k���.cU4h�}����6����24W���x�(]l�[���%�-�. ��!��@\X%�V*������@�7��
k�����e-�.����sso�a����w���,\�a�����w{���l����`\>8���[������>�p�B�����0#�A�,���_�" �G@C��W���5��1O�E�N�l���n��c6k�,�v�m���K�.N�1��P�J����oBX����~��_��7�h�:���b������Y�fVW�Gq���f��a������F~�7on�����q���^{��1���iT\�X���i<xpTt:'"P�$�J�P�%(f���G;+����|�o}�[�����{�1g�������[�29�7��-t�_����{�v��5������][�j�������+�8��,�#��p�]vYd����a1�JG���#�^z�%;��3�1����?�|7�.�����ZW��."P:h����t�J'����@	�P�?�o�������
��&M����}Np?+=��my����G\�5�!�8�%[L�yN����8��ya9g���o~�2����EK.W8�)��%�"@�!�VB����V�2�z�	'O��R���`+vE*" " " �@�i�h���SD@D@D@
B@�� ���������O@�-}V�)" " " ! �V��TD@D@D@�' ��>+����`+vE*" " " ��`K��|�������@AH��"��	H���J>E@D@D@D� $�
�]��������@�$��g%�" " " "Pl��HE@D@D@D }l���O(	��`W�" " " "�>	��Y����������[A�+RH�@�������@ee��^����U+k��IVO�:�6o�l�n���n���Y����'�a�[�~��m���j�*����f���c��������u���m��Mm��%��6m��s��6mJ\���q��>w._����k��WUU����E=��-��k���K�x�����6�e���[�>}�Q�FY'�zB�w�����Y��@���1��N��:��i+�B@��XJ���9i�$����l[l��}��Wv��'�AT��	l��������	'�P���x�g�}f3g��o}�[qL^d����?��o�mw�u��h����k=z����;�����km��v��C���?�����o_{���]�X�x�]p����;'���/~��������1�u������/��~��_�E]���f������'�;���z�~���1y�i�����`Bx#�.���Hc������g��cG����\�������o|���p������m���w�������s�M������3�<��A�"�D(^s�5������<�������(99$ ��C���K/�d��=:���9�����C�<y�=���N�q~��9�����+	b�
�7D��y��[�n���['����V�/���u&K�.u������K ���a�Sy�~����0l������������g�����K\��qK�_|���L�4e�{��7\~�B���f���~�T�8�����Z6}�t'�(_���~��W�`[�r�����n����wo�a�\����V����[5�������o&�>q���F=��r�
7��kl��P���}�wt8�M��S�����z3{��D} �����r������zL���������'�t��^c%���\��m^��5������3��{�����#B��E=�����
k�s�=�8S/�
����>ru����i�������E���;�p�7�b�:WxRF��t���?��Nwc��?���t����?�'�v��@Z��Q��������\<�3�������O?���v���?mu2����D����`�A��98����U���w����~��iA�
.�V�����y�{��wl��e�"3x�`'�,4�t�4�XP�T����X����5����O\������?�����{��@�MC�7��p������:��f�2,B�?����h�����{�/i}�����O������SO=�:��cqd1�fq�����p�'-��������I�������gm�'�P`G'6`�����:������#����_�%��A���c.t�X��p��������������cB��#����������������������?�����t�U[}��%\�
"!M:�
�O�z���z�����������o4�
i��A@=F�����9e�]��P���	�\Q���i�[@�	n����9��!C���o��v��dK]@p��~�\���1r�H���{��
��qK��v��p	���'Nt�h�=�tV�]v����o���E�����w�}.n��-����"���&i"�a����~�+�ID�����e��<q�����~�g���� ���D������<��u�@��4z��g��!J^��������,��"�"�����=�a�x�Q.\���':o�9yr��o~�5:��_~�Y���D���'?q
"��@����8�v�m7;���\�<��u�Y�s8�����Cu�0CutJQ~t�=����U���ns��{S
{��L<G}�*���`�s����c�E��D��cmk�{���T;t��r����{�AX�����<�!�e��������?�q���c�=��7/�E]�:E'��^{���p�(	�aY�G����-s����9D����}����^�u.�����a>�����/�O>j�(w?uH��W����o�i�&��>����&�����"����a�u��q,�<��8��9<S9����-���>!����
���f�����qQ��A:�����(�r�f�mQ~�X����$/�k�{i�����C
��v�64�'�>�+Ol�Y�i��
U��1��?B�P*GE�8?$J�J���a����o�	�}�SO=�x�7��<$�>�+�'g����R���PI�������?V/O�8��������������{��$s�4'���yD��I����o�~��Na�������z������s����(:�i��9a���������.�E@�:�4
�&5�5�0��*��n�c���� ��Q�Q~}�O������r��-��
�e�>�v����*����Yc��yb�J��G�a�����xC��p�"\�p�X�xC,�`D]@�!���o�}"	��sN"�������c��%X|p�w����K}GdRp���zT��o���4���@���kY���E M���0a�d8�>����+�I'����Z4�LXg���4� <��X���W�H�)�{��D���,!4�����8L���8������b]����A�?�0�[��=,,$�������*C�\G��c,lt�o��a2��;����������C����=�0�\�;����9,%Xvqa�t������!���}G���������=|p=���`���y�~}�f�����ov��7U8Ls@�|�Op�e5���o��dm�`�ge�uO�/e�P(y����uo&l�K��r"Y�T���-b�<q��Ai���x0�A���L6�B�jB��q<9��3��G?���C���t�5�������O��{����b4���gN����0q���?vz�_�(1O��������6��3�<��)��qQi�<��pf
�O���u-&�
W�k���BI��=��D���E%t�X<�| �p\cx��v�d�C�Au��0(b���;�1��qX����
����ASw)g�a�����E���&�(����a!��k8�����w��w~��%�\b��v�����������X��t��Ph�4~����
k�_$��!L
 ��������a�J",����v�����R�{G]�C��fN"��Q��}D�S��RN��!o���Q������2��rCd��D�K\������?���SN�������Sq�D���0�t�U�3��K�O��
,
1����>��A'����k!�1��������#4����[�/C�����H#��N����;*�\G,z��#n[Y�[�� >��r�?��j8Q�	��|����aL����+��
)�u����/e�h�~�'|?����u0j�<`�I�w����!�.���P6�s���[��r���������
��`8�7��!\��u�_�7.�K0B,������G*�G�H��������������_��%@=oZ��W��H�0��*���`�n��>�T���D(���J�����KA����I}gL��<�~�������`��.'����E]���^�4�����.�T~����������y����D�����6��`�<K,��7��s�^��\��2�a����4|=xL+�{������>�����g��������_��& [y��r/" " "s�|-:�y!)y" " " " ��: " " " 1' ��R�D@D@D@D@�Mu@D@D@D@bN@�-�����������������������[�H�	6��9	����'" " " �m�A3�*IEND�B`�
#9Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Malladi, Rama (#8)
2 attachment(s)
Re: [PATCH] SVE popcount support

Thank you for the suggestion; we have removed the `xsave` flag.

We have used the following command for benchmarking:
time ./build_fj/bin/psql pop_db -c "select drive_popcount(10000000, 16);"

We ran it 20 times and took the average to flatten any CPU fluctuations. The results observed on `m7g.4xlarge`are in the attached Excel file.

We have also updated the condition for buffer alignment, skipping the alignment process if the buffer is already aligned. This seems to have improved the performance by a few milliseconds because the input buffer provided by `drive_popcount` is already aligned. PFA for the updated patch file.

Thanks,
Chiranmoy

________________________________
From: Malladi, Rama <ramamalladi@hotmail.com>
Sent: Monday, December 9, 2024 10:06 PM
To: Susmitha, Devanga <Devanga.Susmitha@fujitsu.com>; Nathan Bossart <nathandbossart@gmail.com>
Cc: Kirill Reshke <reshkekirill@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org>; Bhattacharya, Chiranmoy <Chiranmoy.Bhattacharya@fujitsu.com>; M A, Rajat <Rajat.Ma@fujitsu.com>; Hajela, Ragesh <Ragesh.Hajela@fujitsu.com>
Subject: Re: [PATCH] SVE popcount support

On 12/9/24 12:21 AM, Devanga.Susmitha@fujitsu.com<mailto:Devanga.Susmitha@fujitsu.com> wrote:
Hello,

We are sharing our patch for pg_popcount with SVE support as a contribution from our side in this thread. We hope this contribution will help in exploring and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the "choose_popcount_functions" method, to determine the correct popcount implementation based on the architecture, thereby requiring fewer code changes. The patch also includes implementations for popcount and popcount masked.
We can reference both solutions and work together toward achieving the most efficient and effective implementation for PostgreSQL.

Thanks for the patch and it looks good. I will review the full patch in the next couple of days. One observation was that the patch has `xsave` flags added. This isn't needed.

`pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt + cflags_popcnt_arm, 'xsave': cflags_xsave}`

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used PostgreSQL community recommended popcount-test-module[0] for benchmarking and observed a speed-up of more than 4x for larger buffers. Even for smaller inputs of size 8 and 16 bytes there aren't any performance degradations observed.
Looking forward to your thoughts!

I tested the patch and here attached is the performance I see on a `c7g.xlarge`. The perf data doesn't quite match to what you observe (especially for 256B). In the chart, I have comparison of baseline, AWS SVE (what I had implemented) and Fujitsu SVE popcount implementations. Can you confirm the command-line you had used for the benchmark run?

I had used the below command-line:

`sudo su postgres -c "/usr/local/pgsql/bin/psql -c 'EXPLAIN ANALYZE SELECT drive_popcount(100000, 16);'"`

________________________________
From: Nathan Bossart <nathandbossart@gmail.com><mailto:nathandbossart@gmail.com>
Sent: Wednesday, December 4, 2024 21:37
To: Malladi, Rama <ramamalladi@hotmail.com><mailto:ramamalladi@hotmail.com>
Cc: Kirill Reshke <reshkekirill@gmail.com><mailto:reshkekirill@gmail.com>; pgsql-hackers <pgsql-hackers@postgresql.org><mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:

Thank you, Kirill, for the review and the feedback. Please find inline my
reply and an updated patch.

Thanks for the updated patch. I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

+# Check for ARMv8 SVE popcount intrinsics
+#
+CFLAGS_POPCNT=""
+PG_POPCNT_OBJS=""
+PGAC_SVE_POPCNT_INTRINSICS([])
+if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
+  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+fi
+if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS="pg_popcount_sve.o"
+  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+fi
+AC_SUBST(CFLAGS_POPCNT)
+AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27). The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code. IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption, and 2) this
is brand new code, so we aren't introducing any regressions for current
users.

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);

Could we combine this with the existing copy above this line? I'm thinking
of something like

#if defined(TRY_POPCNT_FAST) || defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (...)
#endif

#ifdef TRY_POPCNT_FAST
...

+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern int check_sve_support(void);
+#endif

Are we able to use SVE instructions for pg_popcount32(), pg_popcount64(),
and pg_popcount_masked(), too?

+static inline uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+     if (check_sve_support())
+             pg_popcount_optimized = pg_popcount_sve;
+     else
+             pg_popcount_optimized = pg_popcount_slow;
+     return pg_popcount_optimized(buf, bytes);
+}
+
+#endif        /* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */

Can we put this code in the existing choose_popcount_functions() function
in pg_bitutils.c?

+// check if sve supported
+int check_sve_support(void)
+{
+     // Read ID_AA64PFR0_EL1 register
+     uint64_t pfr0;
+     __asm__ __volatile__(
+     "mrs %0, ID_AA64PFR0_EL1"
+     : "=r" (pfr0));
+
+     // SVE bits are 32-35
+     return (pfr0 >> 32) & 0xf;
+}

Is this based on some reference code from a manual that we could cite here?
Or better yet, is it possible to do this without inline assembly (e.g.,
with another intrinsic function)?

+/*
+ * pg_popcount_sve
+ *              Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)

I think this function could benefit from some additional comments to
explain what is happening at each step.

--
nathan

Attachments:

FJ - AWS Comparison.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name="FJ - AWS Comparison.xlsx"Download
v2-0001-SVE-support-for-popcount-and-popcount-masked.patchapplication/octet-stream; name=v2-0001-SVE-support-for-popcount-and-popcount-masked.patchDownload
From 4422558d6ce777bd46283ac772f5b59f67d0011f Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Tue, 10 Dec 2024 13:53:20 +0530
Subject: [PATCH v2] SVE support for popcount and popcount masked

---
 config/c-compiler.m4              |  41 +++++++++
 configure                         |  93 +++++++++++++++++++++
 configure.ac                      |  16 ++++
 meson.build                       |  31 +++++++
 src/Makefile.global.in            |   4 +
 src/include/pg_config.h.in        |   3 +
 src/include/port/pg_bitutils.h    |  14 ++++
 src/makefiles/meson.build         |   3 +-
 src/port/Makefile                 |  11 +++
 src/port/meson.build              |   4 +-
 src/port/pg_bitutils.c            |  10 ++-
 src/port/pg_popcount_sve.c        | 134 ++++++++++++++++++++++++++++++
 src/port/pg_popcount_sve_choose.c |  32 +++++++
 13 files changed, 393 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_sve.c
 create mode 100644 src/port/pg_popcount_sve_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index e112fd45d4..eabe68a773 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,44 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svptrue_b64, svcnt_z, svcnt_x, svadd_x, svaddv, and svwhilelt_b8
+# intrinsic functions.
+#
+# Optional compiler flags can be passed as arguments (e.g., -march=armv8-a+sve).
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[
+  AC_CACHE_CHECK([for svdup_u64 and other intrinsics with CFLAGS=$1],
+                 [pgac_cv_arm_sve_popcnt_intrinsics],
+  [
+    pgac_save_CFLAGS=$CFLAGS
+    CFLAGS="$pgac_save_CFLAGS $1"
+
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>],
+    [svbool_t predicate = svptrue_b64();
+     svuint64_t segment = svdup_u64(0), accum = svdup_u64(0);
+     const char *buf = NULL; /* Simulating a buffer pointer */
+     uint32_t num_vals_segment = svlen_u64(segment);
+
+     /* Using intrinsics as per the code */
+     predicate = svwhilelt_b8(0, 128);
+     segment = svld1(predicate, (const uint64_t *)buf);
+     accum = svadd_x(predicate, accum, svcnt_x(predicate, segment));
+     uint64_t popcnt = svaddv(predicate, accum);
+
+     /* Return computed value, to prevent the above being optimized away */
+     return popcnt;])],
+    [pgac_cv_arm_sve_popcnt_intrinsics=yes],
+    [pgac_cv_arm_sve_popcnt_intrinsics=no])
+
+    CFLAGS="$pgac_save_CFLAGS"
+  ])
+
+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    CFLAGS_POPCNT_ARM="$1"
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi
+])
diff --git a/configure b/configure
index 518c33b73a..a3e41459d5 100755
--- a/configure
+++ b/configure
@@ -647,6 +647,8 @@ MSGFMT_FLAGS
 MSGFMT
 PG_CRC32C_OBJS
 CFLAGS_CRC
+PG_POPCNT_OBJS_ARM
+CFLAGS_POPCNT_ARM
 LIBOBJS
 OPENSSL
 ZSTD
@@ -17159,6 +17161,97 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+CFLAGS_POPCNT_ARM=""
+PG_POPCNT_OBJS_ARM=""
+
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_u64 with CFLAGS=" >&5
+$as_echo_n "checking for svcnt_u64 with CFLAGS=... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+  CFLAGS="$pgac_save_CFLAGS "
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+int
+main ()
+{
+    svbool_t predicate = svptrue_b64();
+    svuint64_t segment, accum = svdup_u64(0);
+    uint64_t numVals = svlen_u64(segment);
+
+    svuint64_t counts = svcnt_u64_z(predicate, segment);
+    accum = svadd_u64_m(predicate, accum, counts);
+    return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics_=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics_" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics_" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics_" = x"yes"; then
+  CFLAGS_POPCNT_ARM=""
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_arm_sve_popcnt_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_u64 with CFLAGS=-march=armv8-a+sve" >&5
+$as_echo_n "checking for svcnt_u64 with CFLAGS=-march=armv8-a+sve... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+  CFLAGS="$pgac_save_CFLAGS -march=armv8-a+sve"
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+int
+main ()
+{
+    svbool_t predicate = svptrue_b64();
+    svuint64_t segment, accum = svdup_u64(0);
+    uint64_t numVals = svlen_u64(segment);
+
+    svuint64_t counts = svcnt_u64_z(predicate, segment);
+    accum = svadd_u64_m(predicate, accum, counts);
+    return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics__march_armv8_a_sve" = x"yes"; then
+  CFLAGS_POPCNT_ARM="-march=armv8-a+sve"
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+fi
+if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+  PG_POPCNT_OBJS_ARM="pg_popcount_sve.o pg_popcount_sve_choose.o"
+
+  $as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index 247ae97fa4..1ea314190b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,22 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM popcount intrinsics
+CFLAGS_POPCNT_ARM=""
+PG_POPCNT_OBJS_ARM=""
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS([])
+  if test x"$pgac_arm_sve_popcnt_intrinsics" != x"yes"; then
+    PGAC_ARM_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
+  fi
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    PG_POPCNT_OBJS_ARM="pg_popcount_sve.o pg_popcount_sve_choose.o"
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+AC_SUBST(CFLAGS_POPCNT_ARM)
+AC_SUBST(PG_POPCNT_OBJS_ARM)
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index e5ce437a5c..6c936f2f2b 100644
--- a/meson.build
+++ b/meson.build
@@ -2191,6 +2191,37 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+cflags_popcnt_arm = []
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+int main(void)
+{
+    const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF);
+    svuint64_t popcnt = svcntb(val);
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE popcount without -march=armv8-a+sve',
+        args: test_c_args + ['-DSVINT64=@0@'.format(cdata.get('SV_INT64_TYPE'))])
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  elif cc.links(prog, name: 'ARM SVE popcount with -march=armv8-a+sve',
+        args: test_c_args + ['-DSVINT64=@0@'.format(cdata.get('SV_INT64_TYPE'))] + ['-march=armv8-a+sve'])
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+    cflags_popcnt_arm += ['-march=armv8-a+sve']
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index eac3d00121..2c32dfab5e 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -262,6 +262,7 @@ CFLAGS_SL_MODULE = @CFLAGS_SL_MODULE@
 CXXFLAGS_SL_MODULE = @CXXFLAGS_SL_MODULE@
 CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
+CFLAGS_POPCNT_ARM = @CFLAGS_POPCNT_ARM@
 CFLAGS_CRC = @CFLAGS_CRC@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 PERMIT_MISSING_VARIABLE_DECLARATIONS = @PERMIT_MISSING_VARIABLE_DECLARATIONS@
@@ -770,6 +771,9 @@ LIBOBJS = @LIBOBJS@
 # files needed for the chosen CRC-32C implementation
 PG_CRC32C_OBJS = @PG_CRC32C_OBJS@
 
+# files needed for the chosen popcount implementation
+PG_POPCNT_OBJS_ARM = @PG_POPCNT_OBJS_ARM@
+
 LIBS := -lpgcommon -lpgport $(LIBS)
 
 # to make ws2_32.lib the last library
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..29c32bbbbe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a3cad46afe..57ebfddb7d 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,14 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64 builds, try using SVE popcount instructions, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ */
+#if defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
+#define TRY_POPCNT_FAST 1
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
@@ -317,6 +325,12 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index aba7411a1b..c0207426c2 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -102,6 +102,7 @@ pgxs_kv = {
     ' '.join(cflags_no_missing_var_decls),
 
   'CFLAGS_CRC': ' '.join(cflags_crc),
+  'CFLAGS_POPCNT_ARM': ' '.join(cflags_popcnt_arm)
   'CFLAGS_UNROLL_LOOPS': ' '.join(unroll_loops_cflags),
   'CFLAGS_VECTORIZE': ' '.join(vectorize_cflags),
 
@@ -179,7 +180,7 @@ pgxs_empty = [
   'WANTED_LANGUAGES',
 
   # Not needed because we don't build the server / PLs with the generated makefile
-  'LIBOBJS', 'PG_CRC32C_OBJS', 'TAS',
+  'LIBOBJS', 'PG_CRC32C_OBJS', 'PG_POPCNT_OBJS', 'PG_POPCNT_OBJS_ARM', 'TAS',
   'PG_TEST_EXTRA',
   'DTRACEFLAGS', # only server has dtrace probes
 
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..2e04ea4d5a 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -38,6 +38,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS = \
 	$(LIBOBJS) \
 	$(PG_CRC32C_OBJS) \
+	$(PG_POPCNT_OBJS_ARM) \
 	bsearch_arg.o \
 	chklocale.o \
 	inet_net_ntop.o \
@@ -87,6 +88,16 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
 pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
 
+# all version of pg_popcount_sve.o need CFLAGS_POPCNT_ARM
+pg_popcount_sve.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_shlib.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_srv.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+
+# all versions of pg_popcount_sve_choose.o need CFLAGS_POPCNT_ARM
+pg_popcount_sve_choose.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_choose_shlib.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+pg_popcount_sve_choose_srv.o: CFLAGS+=$(CFLAGS_POPCNT_ARM)
+
 #
 # Shared library versions of object files
 #
diff --git a/src/port/meson.build b/src/port/meson.build
index c5bceed9cd..21d686a26e 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -91,6 +91,8 @@ replace_funcs_pos = [
   ['pg_crc32c_armv8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 'crc'],
   ['pg_crc32c_armv8_choose', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
   ['pg_crc32c_sb8', 'USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK'],
+  ['pg_popcount_sve', 'USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 'popcnt'],
+  ['pg_popcount_sve_choose', 'USE_SVE_POPCNT_WITH_RUNTIME_CHECK'],
 
   # loongarch
   ['pg_crc32c_loongarch', 'USE_LOONGARCH_CRC32C'],
@@ -99,7 +101,7 @@ replace_funcs_pos = [
   ['pg_crc32c_sb8', 'USE_SLICING_BY_8_CRC32C'],
 ]
 
-pgport_cflags = {'crc': cflags_crc}
+pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt + cflags_popcnt_arm}
 pgport_sources_cflags = {'crc': []}
 
 foreach f : replace_funcs_neg
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index c8399981ee..6b2e6b3794 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -135,7 +135,9 @@ pg_popcount_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-#if defined(HAVE__GET_CPUID)
+#if defined(__aarch64__)
+	return false;						/* cpuid not available in __aarch64__ */
+#elif defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
@@ -176,6 +178,12 @@ choose_popcount_functions(void)
 		pg_popcount_optimized = pg_popcount_avx512;
 		pg_popcount_masked_optimized = pg_popcount_masked_avx512;
 	}
+#elif USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
 #endif
 }
 
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 0000000000..c2a3a4cba0
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,134 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#include <arm_sve.h>
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64      popcnt = 0;
+	const char *aligned = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf);
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * Therefore, we align the buffers only when the input size is sufficiently large.
+	 */
+	if (aligned != buf && bytes > 4 * vec_len)
+	{
+		pre_align = aligned + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		popcnt = svaddv(pred, svcnt_z(pred, svld1(pred, (const uint8 *) buf)));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svld1(pred, (const uint64 *) (buf + i));
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));	/* reduce the accumulators */
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_z(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred;
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	const char *aligned = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf);
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * Therefore, we align the buffers only when the input size is sufficiently large.
+	 */
+	if (aligned != buf && bytes > 4 * vec_len)
+	{
+		pre_align = aligned + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		vec8 = svand_n_u8_m(pred, svld1(pred, (const uint8 *) buf), mask);  /* load and mask */
+		popcnt = svaddv(pred, svcnt_z(pred, vec8));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_n_u64_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_n_u64_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));	/* reduce the accumulators */
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_n_u8_m(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_z(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
diff --git a/src/port/pg_popcount_sve_choose.c b/src/port/pg_popcount_sve_choose.c
new file mode 100644
index 0000000000..5f4e164f9c
--- /dev/null
+++ b/src/port/pg_popcount_sve_choose.c
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve_choose.c
+ *    Test whether we can use the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/port/pg_popcount_sve_choose.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#include <asm/hwcap.h>
+#include <sys/auxv.h>
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+	unsigned long hwcap = getauxval(AT_HWCAP); /* get the HWCAP flags */
+	return (hwcap & HWCAP_SVE) != 0; /* return true if SVE is supported */
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

#10Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#9)
1 attachment(s)
Re: [PATCH] SVE popcount support

Hi all,

Here is the updated patch using pg_attribute_target("arch=armv8-a+sve") to compile the arch-specific function instead of using compiler flags.

---
Chiranmoy

Attachments:

v3-0001-SVE-support-for-popcount-and-popcount-masked.patchapplication/octet-stream; name=v3-0001-SVE-support-for-popcount-and-popcount-masked.patchDownload
From 9ce09b6abaf8e1241e536edfd863ad6dc1f85929 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Thu, 9 Jan 2025 14:13:22 +0530
Subject: [PATCH v3] SVE support for popcount and popcount masked

---
 config/c-compiler.m4           |  42 ++++++++++
 configure                      |  50 +++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  28 +++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  14 ++++
 src/port/Makefile              |   1 +
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  10 ++-
 src/port/pg_popcount_sve.c     | 149 +++++++++++++++++++++++++++++++++
 10 files changed, 306 insertions(+), 1 deletion(-)
 create mode 100644 src/port/pg_popcount_sve.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..6c86811e8c 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,45 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svptrue_b64, svcnt_z, svcnt_x, svadd_x, svaddv, and svwhilelt_b8
+# intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[
+  AC_CACHE_CHECK([for svdup_u64 and other intrinsics with CFLAGS=$1],
+                 [pgac_cv_arm_sve_popcnt_intrinsics],
+  [
+    pgac_save_CFLAGS=$CFLAGS
+    CFLAGS="$pgac_save_CFLAGS $1"
+
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>],
+    #if defined(__has_attribute) && __has_attribute (target)
+      __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    [svbool_t predicate = svptrue_b64();
+     svuint64_t segment = svdup_u64(0), accum = svdup_u64(0);
+     const char *buf = NULL;
+     uint32_t num_vals_segment = svlen_u64(segment);
+
+     predicate = svwhilelt_b8(0, 128);
+     segment = svld1(predicate, (const uint64_t *)buf);
+     accum = svadd_x(predicate, accum, svcnt_x(predicate, segment));
+     uint64_t popcnt = svaddv(predicate, accum);
+
+     /* Return computed value, to prevent the above being optimized away */
+     return popcnt;])],
+    [pgac_cv_arm_sve_popcnt_intrinsics=yes],
+    [pgac_cv_arm_sve_popcnt_intrinsics=no])
+
+    CFLAGS="$pgac_save_CFLAGS"
+  ])
+
+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi
+])
diff --git a/configure b/configure
index a0b5e10ca3..e8ac7b299f 100755
--- a/configure
+++ b/configure
@@ -17159,6 +17159,56 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE intrinsic svcnt_u64" >&5
+$as_echo_n "checking for SVE intrinsic svcnt_u64... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+  CFLAGS="$pgac_save_CFLAGS "
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int
+main ()
+{
+    svbool_t predicate = svptrue_b64();
+    svuint64_t segment, accum = svdup_u64(0);
+    uint64_t numVals = svlen_u64(segment);
+
+    svuint64_t counts = svcnt_u64_z(predicate, segment);
+    accum = svadd_u64_m(predicate, accum, counts);
+    return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+  $as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index d713360f34..ba069ebb29 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index cfd654d291..da04f8813d 100644
--- a/meson.build
+++ b/meson.build
@@ -2194,6 +2194,34 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF);
+    svuint64_t popcnt = svcntb(val);
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE pop count', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..29c32bbbbe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index f8d6fb50b6..3a09bb5d16 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,14 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64, try using SVE popcount instructions, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ */
+#if defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
+#define TRY_POPCNT_FAST 1
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
@@ -317,6 +325,12 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..61a8bcec15 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_popcount_sve.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d4..4a3429c21a 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_popcount_sve.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693..df7cf429c5 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -135,7 +135,9 @@ pg_popcount_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-#if defined(HAVE__GET_CPUID)
+#if defined(__aarch64__)
+	return false;						/* cpuid not available in __aarch64__ */
+#elif defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
@@ -176,6 +178,12 @@ choose_popcount_functions(void)
 		pg_popcount_optimized = pg_popcount_avx512;
 		pg_popcount_masked_optimized = pg_popcount_masked_avx512;
 	}
+#elif USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
 #endif
 }
 
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 0000000000..eea3790c32
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,149 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+#include <sys/auxv.h>
+#include <arm_sve.h>
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+	return getauxval(AT_HWCAP) & HWCAP_SVE;
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64      popcnt = 0;
+	const char *aligned = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf);
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * Therefore, the buffers only when the input size is sufficiently large.
+	 */
+	if (aligned != buf && bytes > 4 * vec_len)
+	{
+		pre_align = aligned + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		popcnt = svaddv(pred, svcnt_z(pred, svld1(pred, (const uint8 *) buf)));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svld1(pred, (const uint64 *) (buf + i));
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_z(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred;
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	const char *aligned = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf);
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * Therefore, the buffers only when the input size is sufficiently large.
+	 */
+	if (aligned != buf && bytes > 4 * vec_len)
+	{
+		pre_align = aligned + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		vec8 = svand_n_u8_m(pred, svld1(pred, (const uint8 *) buf), mask);  /* load and mask */
+		popcnt = svaddv(pred, svcnt_z(pred, vec8));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_n_u64_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_n_u64_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_n_u8_m(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_z(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

#11Malladi, Rama
ramamalladi@hotmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#10)
Re: [PATCH] SVE popcount support

Here is the updated patch using
pg_attribute_target("arch=armv8-a+sve") to compile the arch-specific
function instead of using compiler flags.

---

This looks good. Thanks Chiranmoy and team. Can you address any other
feedback from Nathan or others here? Then we can pursue further reviews
and merging of the patch.

Show quoted text
#12Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Malladi, Rama (#11)
Re: [PATCH] SVE popcount support

This looks good. Thanks Chiranmoy and team. Can you address any other feedback from Nathan or others here? Then we can pursue further reviews and merging of the patch.

Thank you for the review.
If there is no further feedback from the community, may we submit the patch for the next commit fest?

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#12)
Re: [PATCH] SVE popcount support

On Wed, Jan 22, 2025 at 11:04:22AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

If there is no further feedback from the community, may we submit the
patch for the next commit fest?

I would encourage you to create a commitfest entry so that it is picked up
by our automated patch testing tools.

https://commitfest.postgresql.org/

--
nathan

#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#13)
Re: [PATCH] SVE popcount support

The meson configure check seems to fail on my machine:

error: too many arguments to function call, expected 0, have 1
10 | svuint64_t popcnt = svcntb(val);
| ~~~~~~ ^~~

error: returning '__SVInt64_t' from a function with incompatible result type 'int'
12 | return popcnt == 0;
| ^~~~~~~~~~~

The autoconf version seems to work okay, though.

+ pgac_save_CFLAGS=$CFLAGS
+ CFLAGS="$pgac_save_CFLAGS $1"

I don't see any extra compiler flag tests used, so we no longer need this,
right?

+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi

I'm curious why this doesn't use Ac_cachevar like the examples above it
(e.g., PGAC_XSAVE_INTRINSICS).

+  prog = '''
+#include <arm_sve.h>
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    const svuint64_t val = svdup_u64(0xFFFFFFFFFFFFFFFF);
+    svuint64_t popcnt = svcntb(val);
+    /* return computed value, to prevent the above being optimized away */
+    return popcnt == 0;
+}
+'''

This test looks quite different than the autoconf one. Why is that? I
would expect them to be the same. And I think ideally the test would check
that all the intrinsics functions we need are available.

+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+    return getauxval(AT_HWCAP) & HWCAP_SVE;
+}

pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit
more complicated than this. Are we missing something here?

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */

Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs? I'm also curious what level of performance degradation you
were seeing.

+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK

nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above
the #include for pg_bitutils.h (but below the one for c.h).

--
nathan

#15Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#14)
1 attachment(s)
Re: [PATCH] SVE popcount support

The meson configure check seems to fail on my machine
This test looks quite different than the autoconf one. Why is that? I

would expect them to be the same. And I think ideally the test would check
that all the intrinsics functions we need are available.

Fixed, both meson and autoconf have the same test program with all the intrinsics.
Meson should work now.

+ pgac_save_CFLAGS=$CFLAGS
+ CFLAGS="$pgac_save_CFLAGS $1"

I don't see any extra compiler flag tests used, so we no longer need this,

right?

True, removed it.

+  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+    pgac_arm_sve_popcnt_intrinsics=yes
+  fi

I'm curious why this doesn't use Ac_cachevar like the examples above it

(e.g., PGAC_XSAVE_INTRINSICS).

Implemented using Ac_cachevar similar to PGAC_XSAVE_INTRINSICS.

+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK

nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above

the #include for pg_bitutils.h (but below the one for c.h).

Done.

pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit

more complicated than this. Are we missing something here?

SVE is only available in aarch64, so we don't need to worry about aarch32. The latest patch
includes runtime checks for Linux and FreeBSD. For all other operating systems, false is
returned, because we are unable to verify the check.

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */

Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs? I'm also curious what level of performance degradation you
were seeing.

Here is a comparison of all three cases. Alignment is marginally better for inputs
above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
Always Aligned => condition "bytes > 4 * vec_len" is removed.
Unaligned => the whole if block was removed

buf | Always Aligned | Aligned After 128B | Unaligned
--------+---------------+--------------------+------------
16 | 37.851 | 38.203 | 34.971
32 | 37.859 | 38.187 | 34.972
64 | 37.611 | 37.405 | 34.121
128 | 45.357 | 45.897 | 41.890
256 | 62.440 | 63.454 | 58.666
512 | 100.120 | 102.767 | 99.861
1024 | 159.574 | 158.594 | 164.975
2048 | 282.354 | 281.198 | 283.937
4096 | 532.038 | 531.068 | 533.699
8192 | 1038.973 | 1038.083 | 1039.206
16384 | 2028.604 | 2025.843 | 2033.940

---
Chiranmoy

Attachments:

v4-0001-SVE-support-for-popcount-and-popcount-masked.patchapplication/octet-stream; name=v4-0001-SVE-support-for-popcount-and-popcount-masked.patchDownload
From 952412a0be1d9b39f12c86f3882cbdac04e9602a Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Tue, 4 Feb 2025 14:03:28 +0530
Subject: [PATCH v4] SVE support for popcount and popcount masked

---
 config/c-compiler.m4           |  36 ++++++++
 configure                      |  56 ++++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  33 +++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  14 +++
 src/port/Makefile              |   1 +
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  10 ++-
 src/port/pg_popcount_sve.c     | 160 +++++++++++++++++++++++++++++++++
 10 files changed, 322 insertions(+), 1 deletion(-)
 create mode 100644 src/port/pg_popcount_sve.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..c3c2d6fe29 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,39 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svwhilelt_b8, svcntb, svaddv, svadd_x, svcnt_x, svld1,
+# svptrue_b64 and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x and other intrinsics], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf@<:@sizeof(uint64_t)@:>@;
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }],
+  [return sve_popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index ceeef9b091..4faf7def28 100755
--- a/configure
+++ b/configure
@@ -17168,6 +17168,62 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x and other intrinsics" >&5
+$as_echo_n "checking for svcnt_x and other intrinsics... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf[sizeof(uint64_t)];
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }
+int
+main ()
+{
+return sve_popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index d713360f34..ba069ebb29 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 8e128f4982..d3c9a02abc 100644
--- a/meson.build
+++ b/meson.build
@@ -2194,6 +2194,39 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main ()
+{
+  int popcnt = 0;
+  const char buf[sizeof(uint64_t)];
+  svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+  svuint64_t accum = svdup_u64(0), vec;
+  if (svcntb() > 0)
+    popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+  vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+  accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+  popcnt += svaddv(pred64, accum);
+  return popcnt;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..29c32bbbbe 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685..7d771a45dc 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,14 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64, try using SVE popcount instructions, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ */
+#if defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
+#define TRY_POPCNT_FAST 1
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
@@ -315,6 +323,12 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c22431951..61a8bcec15 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_popcount_sve.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d4..4a3429c21a 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_popcount_sve.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693..df7cf429c5 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -135,7 +135,9 @@ pg_popcount_available(void)
 {
 	unsigned int exx[4] = {0, 0, 0, 0};
 
-#if defined(HAVE__GET_CPUID)
+#if defined(__aarch64__)
+	return false;						/* cpuid not available in __aarch64__ */
+#elif defined(HAVE__GET_CPUID)
 	__get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]);
 #elif defined(HAVE__CPUID)
 	__cpuid(exx, 1);
@@ -176,6 +178,12 @@ choose_popcount_functions(void)
 		pg_popcount_optimized = pg_popcount_avx512;
 		pg_popcount_masked_optimized = pg_popcount_masked_avx512;
 	}
+#elif USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
 #endif
 }
 
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 0000000000..736fdfbf7f
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,160 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+#include "port/pg_bitutils.h"
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+#if defined(HAVE_ELF_AUX_INFO) && defined(__aarch64__)	/* FreeBSD */
+	unsigned long hwcap;
+	return elf_aux_info(AT_HWCAP, &hwcap, sizeof(hwcap)) == 0 &&
+		(hwcap & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL) && defined(__aarch64__)	/* Linux */
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64      popcnt = 0;
+	const char *aligned = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf);
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * therefore, align the buffer when the input size is sufficiently large.
+	 */
+	if (aligned != buf && bytes > 4 * vec_len)
+	{
+		pre_align = aligned + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		popcnt = svaddv(pred, svcnt_x(pred, svld1(pred, (const uint8 *) buf)));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svld1(pred, (const uint64 *) (buf + i));
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_x(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred;
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				pre_align,
+				loop_bytes;
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	const char *aligned = (const char *) TYPEALIGN_DOWN(sizeof(uint64_t), buf);
+
+	/*
+	 * For smaller inputs, aligning the buffer degrades the performance.
+	 * therefore, align the buffer when the input size is sufficiently large.
+	 */
+	if (aligned != buf && bytes > 4 * vec_len)
+	{
+		pre_align = aligned + sizeof(uint64_t) - buf;
+		pred = svwhilelt_b8(0U, pre_align);
+		vec8 = svand_x(pred, svld1(pred, (const uint8 *) buf), mask);  /* load and mask */
+		popcnt = svaddv(pred, svcnt_x(pred, vec8));
+		buf += pre_align;
+		bytes -= pre_align;
+	}
+
+	pred = svptrue_b64();
+	loop_bytes = bytes & ~(vec_len * 2 - 1);
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_x(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#15)
Re: [PATCH] SVE popcount support

On Tue, Feb 04, 2025 at 09:01:33AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

+    /*
+     * For smaller inputs, aligning the buffer degrades the performance.
+     * Therefore, the buffers only when the input size is sufficiently large.
+     */

Is the inverse true, i.e., does aligning the buffer improve performance for
larger inputs? I'm also curious what level of performance degradation you
were seeing.

Here is a comparison of all three cases. Alignment is marginally better for inputs
above 1024B, but the difference is small. Unaligned performs better for smaller inputs.
Aligned After 128B => the current implementation "if (aligned != buf && bytes > 4 * vec_len)"
Always Aligned => condition "bytes > 4 * vec_len" is removed.
Unaligned => the whole if block was removed

buf | Always Aligned | Aligned After 128B | Unaligned
--------+---------------+--------------------+------------
16 | 37.851 | 38.203 | 34.971
32 | 37.859 | 38.187 | 34.972
64 | 37.611 | 37.405 | 34.121
128 | 45.357 | 45.897 | 41.890
256 | 62.440 | 63.454 | 58.666
512 | 100.120 | 102.767 | 99.861
1024 | 159.574 | 158.594 | 164.975
2048 | 282.354 | 281.198 | 283.937
4096 | 532.038 | 531.068 | 533.699
8192 | 1038.973 | 1038.083 | 1039.206
16384 | 2028.604 | 2025.843 | 2033.940

Hm. These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment. Is there any reason not
to do that?

+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

--
nathan

#17Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#16)
Re: [PATCH] SVE popcount support

Hm. These results are so similar that I'm tempted to suggest we just
remove the section of code dedicated to alignment. Is there any reason not
to do that?

It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

buf | Not Unrolled | Unrolled x2 | Unrolled x4
------+-------------+-------------+-------------
16 | 4.774 | 4.759 | 5.634
32 | 6.872 | 6.486 | 7.348
64 | 11.070 | 10.249 | 10.617
128 | 20.003 | 16.205 | 16.764
256 | 40.234 | 28.377 | 29.108
512 | 83.825 | 53.420 | 53.658
1024 | 191.181 | 101.677 | 102.727
2048 | 389.160 | 200.291 | 201.544
4096 | 785.742 | 404.593 | 399.134
8192 | 1587.226 | 811.314 | 810.961

/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
      vec64_1 = svld1(pred, (const uint64 *) (buf + i));
      accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
      vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
      accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));

      vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
      accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
      vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
      accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}

-Chiranmoy

#18Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#17)
Re: [PATCH] SVE popcount support

On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

--
nathan

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#18)
Re: [PATCH] SVE popcount support

On Thu, Feb 06, 2025 at 10:33:35AM -0600, Nathan Bossart wrote:

On Thu, Feb 06, 2025 at 08:44:35AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Does this hand-rolled loop unrolling offer any particular advantage? What
do the numbers look like if we don't do this or if we process, say, 4
vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

I've also noticed that the latest patch doesn't compile on my M3 macOS
machine. After a quick glance, I think the problem is that the
TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
versions.

../postgresql/src/port/pg_bitutils.c:230:41: error: invalid output constraint '=q' in asm
230 | __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
| ^
../postgresql/src/port/pg_bitutils.c:247:41: error: invalid output constraint '=q' in asm
247 | __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
| ^
2 errors generated.

--
nathan

#20Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#19)
1 attachment(s)
Re: [PATCH] SVE popcount support

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.

pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native

I've also noticed that the latest patch doesn't compile on my M3 macOS
machine. After a quick glance, I think the problem is that the
TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
versions.

Fixed, we tried using the existing "choose" logic guarded by TRY_POPCNT_FAST.
The latest patch bypasses TRY_POPCNT_FAST by having a separate choose logic
for aarch64.

-Chiranmoy

Attachments:

v5-0001-SVE-support-for-popcount-and-popcount-masked.patchapplication/octet-stream; name=v5-0001-SVE-support-for-popcount-and-popcount-masked.patchDownload
From 57668e64b861b9a722e30d452967376231841e59 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Wed, 19 Feb 2025 12:30:47 +0530
Subject: [PATCH v4] SVE support for popcount and popcount masked

---
 config/c-compiler.m4           |  36 ++++++++++
 configure                      |  56 +++++++++++++++
 configure.ac                   |   9 +++
 meson.build                    |  33 +++++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  13 ++++
 src/port/Makefile              |   1 +
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  44 +++++++++++-
 src/port/pg_popcount_sve.c     | 123 +++++++++++++++++++++++++++++++++
 10 files changed, 318 insertions(+), 1 deletion(-)
 create mode 100644 src/port/pg_popcount_sve.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c13..c3c2d6fe29d 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,39 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svwhilelt_b8, svcntb, svaddv, svadd_x, svcnt_x, svld1,
+# svptrue_b64 and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x and other intrinsics], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf@<:@sizeof(uint64_t)@:>@;
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }],
+  [return sve_popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 0ffcaeb4367..b227b826092 100755
--- a/configure
+++ b/configure
@@ -17049,6 +17049,62 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x and other intrinsics" >&5
+$as_echo_n "checking for svcnt_x and other intrinsics... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf[sizeof(uint64_t)];
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }
+int
+main ()
+{
+return sve_popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index f56681e0d91..5c649870550 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2016,6 +2016,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 7dd7110318d..db3120442b5 100644
--- a/meson.build
+++ b/meson.build
@@ -2195,6 +2195,39 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main ()
+{
+  int popcnt = 0;
+  const char buf[sizeof(uint64_t)];
+  svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+  svuint64_t accum = svdup_u64(0), vec;
+  if (svcntb() > 0)
+    popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+  vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+  accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+  popcnt += svaddv(pred64, accum);
+  return popcnt;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..29c32bbbbe3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685a..b73feb30da5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -315,6 +315,19 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+/*
+ * On AArch64, try using SVE popcount instructions, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ */
+#elif USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern int	pg_popcount32(uint32 word);
+extern int	pg_popcount64(uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..61a8bcec15c 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_popcount_sve.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..4a3429c21a9 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_popcount_sve.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..f0d92bf317b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -339,6 +339,47 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 
 #endif							/* TRY_POPCNT_FAST */
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+static uint64 pg_popcount_choose_aarch64(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose_aarch64(const char *buf, int bytes, bits8 mask);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose_aarch64;
+uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose_aarch64;
+
+/*
+ * On AArch64 these functions are invoked on the first call to pg_popcount and
+ * pg_popcount_masked. They detect whether we can use the SVE implementations,
+ * and replace the function pointers so that subsequent calls are routed
+ * directly to the chosen implementation.
+ */
+static inline void
+choose_popcount_functions_aarch64(void)
+{
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
+	else
+	{
+		pg_popcount_optimized = pg_popcount_slow;
+		pg_popcount_masked_optimized = pg_popcount_masked_slow;
+	}
+}
+
+static uint64
+pg_popcount_choose_aarch64(const char *buf, int bytes)
+{
+	choose_popcount_functions_aarch64();
+	return pg_popcount_optimized(buf, bytes);
+}
+
+static uint64
+pg_popcount_masked_choose_aarch64(const char *buf, int bytes, bits8 mask)
+{
+	choose_popcount_functions_aarch64();
+	return pg_popcount_masked(buf, bytes, mask);
+}
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
 
 /*
  * pg_popcount32_slow
@@ -507,6 +548,7 @@ pg_popcount64(uint64 word)
 	return pg_popcount64_slow(word);
 }
 
+#ifndef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
 /*
  * pg_popcount_optimized
  *		Returns the number of 1-bits in buf
@@ -526,5 +568,5 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 {
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
-
+#endif							/* !USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
 #endif							/* !TRY_POPCNT_FAST */
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 00000000000..060193eec11
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,123 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+#include "port/pg_bitutils.h"
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+#if defined(HAVE_ELF_AUX_INFO) && defined(__aarch64__)	/* FreeBSD */
+	unsigned long hwcap;
+	return elf_aux_info(AT_HWCAP, &hwcap, sizeof(hwcap)) == 0 &&
+		(hwcap & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL) && defined(__aarch64__)	/* Linux */
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred = svptrue_b64();
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64      popcnt = 0;
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svld1(pred, (const uint64 *) (buf + i));
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_x(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred = svptrue_b64();
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_x(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#20)
Re: [PATCH] SVE popcount support

On Wed, Feb 19, 2025 at 09:31:50AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Hm. Any idea why that is? I wonder if the compiler isn't using as many
SVE registers as it could for this.

Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.

pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native

Interesting. I do see different assembly with the 2 and 4 register
versions, but I didn't get to testing it on a machine with SVE support
today.

Besides some additional benchmarking, I might make some small adjustments
to the patch. But overall, it seems to be in decent shape.

--
nathan

#22Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#21)
Re: [PATCH] SVE popcount support

Interesting. I do see different assembly with the 2 and 4 register
versions, but I didn't get to testing it on a machine with SVE support
today.

Besides some additional benchmarking, I might make some small adjustments
to the patch. But overall, it seems to be in decent shape.

Sounds good. Let us know your findings.

-Chiranmoy

#23Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#22)
Re: [PATCH] SVE popcount support

On Fri, Mar 07, 2025 at 03:20:07AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Sounds good. Let us know your findings.

Alright, here's what I saw on an R8g for drive_popcount(1000000, N):

8-byte words master v5-no-sve v5-sve v5-4reg
1 2.540 ms 2.170 ms 1.807 ms 2.178 ms
2 2.534 ms 2.180 ms 1.804 ms 2.167 ms
4 3.988 ms 3.240 ms 1.590 ms 2.879 ms
8 5.033 ms 4.672 ms 2.175 ms 2.525 ms
16 8.252 ms 10.916 ms 3.235 ms 3.588 ms
32 20.932 ms 22.883 ms 5.134 ms 5.395 ms
64 40.446 ms 45.668 ms 9.817 ms 9.285 ms
128 66.087 ms 91.386 ms 20.072 ms 17.175 ms
256 153.852 ms 182.594 ms 40.447 ms 32.212 ms
512 246.271 ms 300.941 ms 87.116 ms 60.729 ms
1024 487.180 ms 607.289 ms 180.574 ms 116.948 ms
2048 969.335 ms 1223.838 ms 363.595 ms 232.575 ms
4096 1934.646 ms 2472.154 ms 729.525 ms 459.495 ms

(Note that there should be no need to test anything smaller than 8 bytes
because we use the inline version in pg_bitutils.h in that case.)

v5-no-sve is the result of using a function pointer, but pointing to the
"slow" versions instead of the SVE version. v5-sve is the result of the
latest patch in this thread on a machine with SVE support, and v5-4reg is
the result of the latest patch in this thread modified to process 4
register's worth of data at a time.

The biggest takeaways for me are as follows:

* The 4-register version does show some nice improvements as the data
grows.
* Machines without SVE will likely incur a rather sizable regression from
the newly introduced function pointer.

For the latter point, I think we should consider trying to add a separate
Neon implementation that we use as a fallback for machines that don't have
SVE. My understanding is that Neon is virtually universally supported on
64-bit Arm gear, so that will not only help offset the function pointer
overhead but may even improve performance for a much wider set of machines.

--
nathan

#24Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#23)
1 attachment(s)
Re: [PATCH] SVE popcount support

On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandbossart@gmail.com wrote:

v5-no-sve is the result of using a function pointer, but pointing to the
"slow" versions instead of the SVE version. v5-sve is the result of the
latest patch in this thread on a machine with SVE support, and v5-4reg is
the result of the latest patch in this thread modified to process 4
register's worth of data at a time.

Nice, I wonder why I did not observe any performance gain in the 4reg
version. Did you modify the 4reg version code?

One possible explanation is that you used Graviton4 based instances
whereas I used Graviton3 instances.

For the latter point, I think we should consider trying to add a separate
Neon implementation that we use as a fallback for machines that don't have
SVE. My understanding is that Neon is virtually universally supported on
64-bit Arm gear, so that will not only help offset the function pointer
overhead but may even improve performance for a much wider set of machines.

I have added the NEON implementation in the latest patch.

Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge:
Scalar - 692ms
Neon - 298ms
SVE - 112ms

-Chiranmoy

Attachments:

v6-0001-SVE-and-NEON-support-for-popcount.patchapplication/octet-stream; name=v6-0001-SVE-and-NEON-support-for-popcount.patchDownload
From e2830537bd5388e87ba7c4ebe61a156fcee17e4b Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Wed, 12 Mar 2025 15:21:36 +0530
Subject: [PATCH v6] SVE and NEON support for popcount

---
 config/c-compiler.m4           |  36 ++++++++++
 configure                      |  56 +++++++++++++++
 configure.ac                   |   9 +++
 meson.build                    |  33 +++++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  25 +++++++
 src/port/Makefile              |   2 +
 src/port/meson.build           |   2 +
 src/port/pg_bitutils.c         |  55 +++++++++++++-
 src/port/pg_popcount_neon.c    |  91 +++++++++++++++++++++++
 src/port/pg_popcount_sve.c     | 127 +++++++++++++++++++++++++++++++++
 11 files changed, 438 insertions(+), 1 deletion(-)
 create mode 100644 src/port/pg_popcount_neon.c
 create mode 100644 src/port/pg_popcount_sve.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c13..c3c2d6fe29d 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,39 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svwhilelt_b8, svcntb, svaddv, svadd_x, svcnt_x, svld1,
+# svptrue_b64 and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x and other intrinsics], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf@<:@sizeof(uint64_t)@:>@;
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }],
+  [return sve_popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 0ffcaeb4367..b227b826092 100755
--- a/configure
+++ b/configure
@@ -17049,6 +17049,62 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x and other intrinsics" >&5
+$as_echo_n "checking for svcnt_x and other intrinsics... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf[sizeof(uint64_t)];
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }
+int
+main ()
+{
+return sve_popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index f56681e0d91..5c649870550 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2016,6 +2016,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 7dd7110318d..db3120442b5 100644
--- a/meson.build
+++ b/meson.build
@@ -2195,6 +2195,39 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main ()
+{
+  int popcnt = 0;
+  const char buf[sizeof(uint64_t)];
+  svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+  svuint64_t accum = svdup_u64(0), vec;
+  if (svcntb() > 0)
+    popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+  vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+  accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+  popcnt += svaddv(pred64, accum);
+  return popcnt;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798abd..29c32bbbbe3 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685a..ffafbc926af 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,16 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On aarch64, try using SVE popcount instructions, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ *
+ * Otherwise, we fall back to NEON implementation.
+ */
+#if defined(__aarch64__) && defined(__ARM_NEON)
+#define POPCNT_FAST_AARCH64 1
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
@@ -315,6 +325,21 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#elif POPCNT_FAST_AARCH64
+extern int	pg_popcount32(uint32 word);
+extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount_neon(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
+
+/* Attempt to use the SVE instructions, but perform a runtime check first */
+#if USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..9ea21fb6477 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,8 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_popcount_neon.o \
+	pg_popcount_sve.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7a0743ab233 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,8 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_popcount_neon.c',
+  'pg_popcount_sve.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..2afe76b2796 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -339,6 +339,59 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 
 #endif							/* TRY_POPCNT_FAST */
 
+#ifdef POPCNT_FAST_AARCH64
+static uint64 pg_popcount_choose_aarch64(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose_aarch64(const char *buf, int bytes, bits8 mask);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose_aarch64;
+uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose_aarch64;
+
+/*
+ * On AArch64 these functions are invoked on the first call to pg_popcount and
+ * pg_popcount_masked. They detect whether we can use the SVE implementations,
+ * and replace the function pointers so that subsequent calls are routed
+ * directly to the chosen implementation.
+ */
+static inline void
+choose_popcount_functions_aarch64(void)
+{
+	pg_popcount_optimized = pg_popcount_neon;
+	pg_popcount_masked_optimized = pg_popcount_masked_neon;
+
+#if USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
+#endif
+}
+
+static uint64
+pg_popcount_choose_aarch64(const char *buf, int bytes)
+{
+	choose_popcount_functions_aarch64();
+	return pg_popcount_optimized(buf, bytes);
+}
+
+static uint64
+pg_popcount_masked_choose_aarch64(const char *buf, int bytes, bits8 mask)
+{
+	choose_popcount_functions_aarch64();
+	return pg_popcount_masked(buf, bytes, mask);
+}
+
+int
+pg_popcount32(uint32 word)
+{
+	return pg_popcount32_slow(word);
+}
+
+int
+pg_popcount64(uint64 word)
+{
+	return pg_popcount64_slow(word);
+}
+#endif							/* POPCNT_FAST_AARCH64 */
 
 /*
  * pg_popcount32_slow
@@ -486,7 +539,7 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_FAST
+#if !defined(TRY_POPCNT_FAST) && !defined(POPCNT_FAST_AARCH64)
 
 /*
  * When the POPCNT instruction is not available, there's no point in using
diff --git a/src/port/pg_popcount_neon.c b/src/port/pg_popcount_neon.c
new file mode 100644
index 00000000000..ee78fa01f26
--- /dev/null
+++ b/src/port/pg_popcount_neon.c
@@ -0,0 +1,91 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_neon.c
+ *	  Holds the NEON pg_popcount() implementation.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_neon.c
+ *POPCNT_FAST_AARCH64
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef POPCNT_FAST_AARCH64
+
+#include <arm_neon.h>
+
+/*
+ * pg_popcount_neon
+ *		Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_neon(const char *buf, int bytes)
+{
+	uint8x16_t	vec8;
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0);
+	uint32		i = 0,
+				vec_len = sizeof(uint8x16_t),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64      popcnt = 0;
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec8 = vld1q_u8((const uint8 *) (buf + i));
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+		vec8 = vld1q_u8((const uint8 *) (buf + i + vec_len));
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+
+	/* Process any remaining bytes */
+	bytes -= loop_bytes;
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+	return popcnt;
+}
+
+/*
+* pg_popcount_masked_neon
+*		Returns the number of 1-bits in buf after applying the mask
+*/
+uint64
+pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask)
+{
+	uint8x16_t	vec8,
+				mask_vec = vdupq_n_u8(mask);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0);
+	uint32		i = 0,
+				vec_len = sizeof(uint8x16_t),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64      popcnt = 0;
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec8 = vandq_u8(vld1q_u8((const uint8 *) (buf + i)), mask_vec);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+		vec8 = vandq_u8(vld1q_u8((const uint8 *) (buf + i)), mask_vec);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+
+	/* Process any remaining bytes */
+	bytes -= loop_bytes;
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+	return popcnt;
+}
+
+#endif							/* POPCNT_FAST_AARCH64 */
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 00000000000..20bb20d9130
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,127 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+/*
+ * It's unlikely that USE_SVE_POPCNT_WITH_RUNTIME_CHECK is set and
+ * POPCNT_FAST_AARCH64 is not, but we check it anyway to be sure.
+ */
+#if defined(POPCNT_FAST_AARCH64) && defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
+
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+#if defined(HAVE_ELF_AUX_INFO) && defined(__aarch64__)	/* FreeBSD */
+	unsigned long hwcap;
+	return elf_aux_info(AT_HWCAP, &hwcap, sizeof(hwcap)) == 0 &&
+		(hwcap & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL) && defined(__aarch64__)	/* Linux */
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred = svptrue_b64();
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64      popcnt = 0;
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svld1(pred, (const uint64 *) (buf + i));
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_x(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred = svptrue_b64();
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	/* Process 2 complete vectors */
+	for (; i < loop_bytes; i += vec_len * 2)
+	{
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	}
+
+	/* Reduce the accumulators */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_x(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif			/* POPCNT_FAST_AARCH64 && USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#24)
Re: [PATCH] SVE popcount support

On Wed, Mar 12, 2025 at 10:34:46AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

On Wed, Mar 12, 2025 at 02:41:18AM +0000, nathandbossart@gmail.com wrote:

v5-no-sve is the result of using a function pointer, but pointing to the
"slow" versions instead of the SVE version. v5-sve is the result of the
latest patch in this thread on a machine with SVE support, and v5-4reg is
the result of the latest patch in this thread modified to process 4
register's worth of data at a time.

Nice, I wonder why I did not observe any performance gain in the 4reg
version. Did you modify the 4reg version code?

One possible explanation is that you used Graviton4 based instances
whereas I used Graviton3 instances.

Yeah, it looks like the number of vector registers is different [0]https://github.com/aws/aws-graviton-getting-started?tab=readme-ov-file#building-for-graviton.

For the latter point, I think we should consider trying to add a separate
Neon implementation that we use as a fallback for machines that don't have
SVE. My understanding is that Neon is virtually universally supported on
64-bit Arm gear, so that will not only help offset the function pointer
overhead but may even improve performance for a much wider set of machines.

I have added the NEON implementation in the latest patch.

Here are the numbers for drive_popcount(1000000, 1024) on m7g.8xlarge:
Scalar - 692ms
Neon - 298ms
SVE - 112ms

Those are nice results. I'm a little worried about the Neon implementation
for smaller inputs since it uses a per-byte loop for the remaining bytes,
though. If we can ensure there's no regression there, I think this patch
will be in decent shape.

[0]: https://github.com/aws/aws-graviton-getting-started?tab=readme-ov-file#building-for-graviton

--
nathan

#26Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#25)
1 attachment(s)
Re: [PATCH] SVE popcount support

On Wed, Mar 13, 2025 at 12:02:07AM +0000, nathandbossart@gmail.com wrote:

Those are nice results. I'm a little worried about the Neon implementation
for smaller inputs since it uses a per-byte loop for the remaining bytes,
though. If we can ensure there's no regression there, I think this patch
will be in decent shape.

True, the neon implementation in patch v6 did perform worse for smaller inputs.
This is solved in v7, we have added pg_popcount64 to speed up the processing of
smaller inputs/remaining bytes. Also, similar to sve, the neon-2reg version
performed better than neon-1reg but no improvement in neon-4reg.

The below table compares patches v6 and v7 on m7g.4xlarge
Query: SELECT drive_popcount(1000000, 8-byte words);
8-byte words | master | v6-neon-2reg| v7-neon-2reg| v7-sve
--------------+----------+-------------+-------------+--------
1 | 4.051 | 6.239 | 3.431 | 3.343
2 | 4.429 | 10.773 | 3.899 | 3.335
3 | 4.844 | 14.066 | 4.398 | 3.348
4 | 5.324 | 3.342 | 3.663 | 3.365
5 | 5.900 | 7.108 | 4.349 | 4.441
6 | 6.478 | 11.720 | 4.851 | 4.441
7 | 7.192 | 15.686 | 5.551 | 4.447
8 | 8.016 | 4.288 | 4.367 | 4.013

We modified [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com to get the numbers for pg_popcount_masked
8-byte words | master | v7-neon-2reg| v7-sve
--------------+----------+-------------+--------
1 | 4.289 | 4.202 | 3.827
2 | 4.993 | 4.662 | 3.823
3 | 5.981 | 5.459 | 3.834
4 | 6.438 | 4.230 | 3.846
5 | 7.169 | 5.236 | 5.072
6 | 7.949 | 5.922 | 5.106
7 | 9.130 | 6.535 | 5.060
8 | 9.796 | 5.328 | 4.718
512 | 387.543 | 182.801 | 77.077
1024 | 760.644 | 360.660 | 150.519

[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

-Chiranmoy

Attachments:

v7-0001-SVE-and-NEON-support-for-pg_popcount.patchapplication/octet-stream; name=v7-0001-SVE-and-NEON-support-for-pg_popcount.patchDownload
From e40f2da5ef148627c15bd88dc1725ad391dbb5f9 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Wed, 19 Mar 2025 16:23:38 +0530
Subject: [PATCH v7] SVE and NEON support for pg_popcount

---
 config/c-compiler.m4           |  36 +++++++++
 configure                      |  56 +++++++++++++
 configure.ac                   |   9 +++
 meson.build                    |  33 ++++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  25 ++++++
 src/port/Makefile              |   2 +
 src/port/meson.build           |   2 +
 src/port/pg_bitutils.c         |  55 ++++++++++++-
 src/port/pg_popcount_neon.c    | 138 +++++++++++++++++++++++++++++++++
 src/port/pg_popcount_sve.c     | 133 +++++++++++++++++++++++++++++++
 11 files changed, 490 insertions(+), 2 deletions(-)
 create mode 100644 src/port/pg_popcount_neon.c
 create mode 100644 src/port/pg_popcount_sve.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c13..c3c2d6fe29d 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,39 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_POPCNT_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE popcount instructions using the
+# svdup_u64, svwhilelt_b8, svcntb, svaddv, svadd_x, svcnt_x, svld1,
+# svptrue_b64 and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x and other intrinsics], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf@<:@sizeof(uint64_t)@:>@;
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }],
+  [return sve_popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index 93fddd69981..100b499c9bb 100755
--- a/configure
+++ b/configure
@@ -17381,6 +17381,62 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x and other intrinsics" >&5
+$as_echo_n "checking for svcnt_x and other intrinsics... " >&6; }
+if ${pgac_cv_arm_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute(target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int sve_popcount_test(void)
+    {
+      int popcnt = 0;
+      const char buf[sizeof(uint64_t)];
+      svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+      svuint64_t accum = svdup_u64(0), vec;
+      if (svcntb() > 0)
+        popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+      vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+      accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+      popcnt += svaddv(pred64, accum);
+      return popcnt;
+    }
+int
+main ()
+{
+return sve_popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_arm_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_arm_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..08e3fd9f6f9 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2057,6 +2057,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_arm_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM popcount instructions.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 13c13748e5d..bcfb05bd74d 100644
--- a/meson.build
+++ b/meson.build
@@ -2290,6 +2290,39 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of ARM SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main ()
+{
+  int popcnt = 0;
+  const char buf[sizeof(uint64_t)];
+  svbool_t pred8 = svwhilelt_b8(0, 8), pred64 = svptrue_b64();
+  svuint64_t accum = svdup_u64(0), vec;
+  if (svcntb() > 0)
+    popcnt = svaddv(pred8, svcnt_x(pred8, svld1(pred8, (const uint8_t *) buf)));
+  vec = svand_x(pred64, svld1(pred64, (const uint64_t *) buf), 0xf0f0);
+  accum = svadd_x(pred64, accum, svcnt_x(pred64, vec));
+  popcnt += svaddv(pred64, accum);
+  return popcnt;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..c7c08a907d8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -654,6 +654,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685a..00d9e8a54ce 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,16 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On aarch64, try using SVE popcount instruction, but only if
+ * we can verify that the CPU supports it via a runtime check.
+ *
+ * Otherwise, we fall back to NEON implementation.
+ */
+#if defined(__aarch64__) && defined(__ARM_NEON)
+#define POPCNT_FAST_AARCH64 1
+#endif
+
 #ifdef TRY_POPCNT_FAST
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
@@ -315,6 +325,21 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#elif POPCNT_FAST_AARCH64
+extern int	pg_popcount32(uint32 word);
+extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount_neon(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask);
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
+
+/* Attempt to use the SVE instructions, but perform a runtime check first */
+#if USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern bool pg_popcount_sve_available(void);
+extern uint64 pg_popcount_sve(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..9ea21fb6477 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,8 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_popcount_avx512.o \
+	pg_popcount_neon.o \
+	pg_popcount_sve.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
 	pgmkdirp.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..7a0743ab233 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,8 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_popcount_avx512.c',
+  'pg_popcount_neon.c',
+  'pg_popcount_sve.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
   'pgmkdirp.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..86de39b6274 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,10 +103,16 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
+/*
+ * Since NEON is virtually available on all aarch64 machines,
+ * hand-rolled scalar implementations are unnecessary.
+ */
+#ifndef POPCNT_FAST_AARCH64
 static inline int pg_popcount32_slow(uint32 word);
 static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
+#endif 							/* !POPCNT_FAST_AARCH64 */
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
@@ -339,6 +345,49 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 
 #endif							/* TRY_POPCNT_FAST */
 
+#ifdef POPCNT_FAST_AARCH64
+static uint64 pg_popcount_choose_aarch64(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose_aarch64(const char *buf, int bytes, bits8 mask);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose_aarch64;
+uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose_aarch64;
+
+/*
+ * On AArch64 these functions are invoked on the first call to pg_popcount and
+ * pg_popcount_masked. They detect whether we can use the SVE implementations,
+ * and replace the function pointers so that subsequent calls are routed
+ * directly to the chosen implementation.
+ */
+static inline void
+choose_popcount_functions_aarch64(void)
+{
+	pg_popcount_optimized = pg_popcount_neon;
+	pg_popcount_masked_optimized = pg_popcount_masked_neon;
+
+#if USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
+#endif
+}
+
+static uint64
+pg_popcount_choose_aarch64(const char *buf, int bytes)
+{
+	choose_popcount_functions_aarch64();
+	return pg_popcount_optimized(buf, bytes);
+}
+
+static uint64
+pg_popcount_masked_choose_aarch64(const char *buf, int bytes, bits8 mask)
+{
+	choose_popcount_functions_aarch64();
+	return pg_popcount_masked(buf, bytes, mask);
+}
+#endif							/* POPCNT_FAST_AARCH64 */
+
+#ifndef POPCNT_FAST_AARCH64
 
 /*
  * pg_popcount32_slow
@@ -486,7 +535,9 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_FAST
+#endif 							/* !POPCNT_FAST_AARCH64 */
+
+#if !defined(TRY_POPCNT_FAST) && !defined(POPCNT_FAST_AARCH64)
 
 /*
  * When the POPCNT instruction is not available, there's no point in using
@@ -527,4 +578,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !TRY_POPCNT_FAST */
+#endif							/* !TRY_POPCNT_FAST && !POPCNT_FAST_AARCH64 */
diff --git a/src/port/pg_popcount_neon.c b/src/port/pg_popcount_neon.c
new file mode 100644
index 00000000000..32e359d4f4e
--- /dev/null
+++ b/src/port/pg_popcount_neon.c
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_neon.c
+ *	  Holds the NEON pg_popcount() implementation.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_neon.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+#ifdef POPCNT_FAST_AARCH64
+
+#include <arm_neon.h>
+
+/*
+ * pg_popcount32
+ *		Return the number of 1 bits set in a 32-bit word
+ */
+int
+pg_popcount32(uint32 word)
+{
+	return pg_popcount64(((uint64) word));
+}
+
+/*
+ * pg_popcount64
+ *		Return the number of 1 bits set in a 64-bit word
+ */
+int
+pg_popcount64(uint64 word)
+{
+	return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}
+
+/*
+ * pg_popcount_neon
+ *		Returns the number of 1-bits in buf
+ */
+uint64
+pg_popcount_neon(const char *buf, int bytes)
+{
+	uint8x16_t		vec8;
+	uint64x2_t		accum1 = vdupq_n_u64(0),
+					accum2 = vdupq_n_u64(0);
+	uint32			vec_len = sizeof(uint8x16_t),
+					loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64			popcnt = 0;
+	const uint64   *words;
+
+	if (loop_bytes)
+	{
+		/* Process two 16 byte vector */
+		for (uint32 i = 0; i < loop_bytes; i += vec_len * 2)
+		{
+			vec8 = vld1q_u8((const uint8 *) (buf + i));
+			accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+			vec8 = vld1q_u8((const uint8 *) (buf + i + vec_len));
+			accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+		}
+
+		/* Reduce the accumulators */
+		popcnt = vaddvq_u64(vaddq_u64(accum1, accum2));
+		bytes -= loop_bytes;
+		buf += loop_bytes;
+	}
+
+	/* Process remaining 64-bit words */
+	words = (const uint64 *) buf;
+	buf += bytes & ~7;
+	while (bytes >= 8)
+	{
+		popcnt += pg_popcount64(*words++);
+		bytes -= 8;
+	}
+
+	/* Process any remaining bytes */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+	return popcnt;
+}
+
+/*
+* pg_popcount_masked_neon
+*		Returns the number of 1-bits in buf after applying the mask
+*/
+uint64
+pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask)
+{
+	uint8x16_t		vec8,
+					mask_vec = vdupq_n_u8(mask);
+	uint64x2_t		accum1 = vdupq_n_u64(0),
+					accum2 = vdupq_n_u64(0);
+	uint32			vec_len = sizeof(uint8x16_t),
+					loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64			popcnt = 0,
+					mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	const uint64   *words;
+
+	if (loop_bytes)
+	{
+		/* Process two 16 byte vector */
+		for (uint32 i = 0; i < loop_bytes; i += vec_len * 2)
+		{
+			vec8 = vandq_u8(vld1q_u8((const uint8 *) (buf + i)), mask_vec);
+			accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+			vec8 = vandq_u8(vld1q_u8((const uint8 *) (buf + i + vec_len)), mask_vec);
+			accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec8))));
+		}
+
+		/* Reduce the accumulators */
+		popcnt = vaddvq_u64(vaddq_u64(accum1, accum2));
+		bytes -= loop_bytes;
+		buf += loop_bytes;
+	}
+
+	/* Process remaining 64-bit words */
+	words = (const uint64 *) buf;
+	buf += bytes & ~7;
+	while (bytes >= 8)
+	{
+		popcnt += pg_popcount64(*words++ & mask64);
+		bytes -= 8;
+	}
+
+	/* Process any remaining bytes */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+	return popcnt;
+}
+
+#endif							/* POPCNT_FAST_AARCH64 */
diff --git a/src/port/pg_popcount_sve.c b/src/port/pg_popcount_sve.c
new file mode 100644
index 00000000000..0ecde505820
--- /dev/null
+++ b/src/port/pg_popcount_sve.c
@@ -0,0 +1,133 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_sve.c
+ *	  Holds the SVE pg_popcount() implementation.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_sve.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+#include "port/pg_bitutils.h"
+
+/*
+ * It's unlikely that USE_SVE_POPCNT_WITH_RUNTIME_CHECK is set and
+ * POPCNT_FAST_AARCH64 is not, but we check it anyway to be sure.
+ */
+#if defined(POPCNT_FAST_AARCH64) && defined(USE_SVE_POPCNT_WITH_RUNTIME_CHECK)
+
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * Returns true if the CPU supports the instructions required for the SVE
+ * pg_popcount() implementation.
+ */
+bool
+pg_popcount_sve_available(void)
+{
+#if defined(HAVE_ELF_AUX_INFO) && defined(__aarch64__)	/* FreeBSD */
+	unsigned long hwcap;
+	return elf_aux_info(AT_HWCAP, &hwcap, sizeof(hwcap)) == 0 &&
+		(hwcap & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL) && defined(__aarch64__)	/* Linux */
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns the number of 1-bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	svbool_t    pred = svptrue_b64();
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64      popcnt = 0;
+
+	if (loop_bytes)
+	{
+		/* Process 2 complete vectors */
+		for (; i < loop_bytes; i += vec_len * 2)
+		{
+			vec64 = svld1(pred, (const uint64 *) (buf + i));
+			accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+			vec64 = svld1(pred, (const uint64 *) (buf + i + vec_len));
+			accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+		}
+
+		/* Reduce the accumulators */
+		popcnt = svaddv(pred, svadd_x(pred, accum1, accum2));
+	}
+
+	/* Process the last incomplete vector  */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		popcnt += svaddv(pred, svcnt_x(pred, svld1(pred, (const uint8 *) (buf + i))));
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns the number of 1-bits in buf after applying the mask
+ */
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	svbool_t	pred = svptrue_b64();
+	svuint8_t   vec8;
+	svuint64_t  vec64,
+				accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	uint32		i = 0,
+				vec_len = svcntb(),
+				loop_bytes = bytes & ~(vec_len * 2 - 1);
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	if (loop_bytes)
+	{
+		/* Process 2 complete vectors */
+		for (; i < loop_bytes; i += vec_len * 2)
+		{
+			vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i)), mask64);
+			accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+			vec64 = svand_x(pred, svld1(pred, (const uint64 *) (buf + i + vec_len)), mask64);
+			accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+		}
+
+		/* Reduce the accumulators */
+		popcnt = svaddv(pred, svadd_x(pred, accum1, accum2));
+	}
+
+	/* Process the last incomplete vectors */
+	for(; i < bytes; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, (uint32) bytes);
+		vec8 = svand_x(pred, svld1(pred, (const uint8 *) (buf + i)), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec8));
+	}
+
+	return popcnt;
+}
+
+#endif			/* POPCNT_FAST_AARCH64 && USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.34.1

#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#26)
3 attachment(s)
Re: [PATCH] SVE popcount support

I've been preparing these for commit, and I've attached what I have so far.
A few notes:

* 0001 just renames the TRY_POPCNT_FAST macro to indicate that it's
x86_64-specific. IMO this is worth doing indpendent of this patch set,
but it's more important with the patch set since we need something
similar for Aarch64. I think we should also consider moving the x86_64
stuff to its own file (perhaps combining it with the AVX-512 stuff), but
that can probably wait until later.

* 0002 introduces the Neon implementation, which conveniently doesn't need
configure-time checks or function pointers. I noticed that some
compilers (e.g., Apple clang 16) compile in Neon instructions already,
but our hand-rolled implementation is better about instruction-level
parallelism and seems to still be quite a bit faster.

* 0003 introduces the SVE implementation. You'll notice I've moved all the
function pointer gymnastics into the pg_popcount_aarch64.c file, which is
where the Neon implementations live, too. I also tried to clean up the
configure checks a bit. I imagine it's possible to make them more
compact, but I felt that the enhanced readability was worth it.

* For both Neon and SVE, I do see improvements with looping over 4
registers at a time, so IMHO it's worth doing so even if it performs the
same as 2-register blocks on some hardware. I did add a 2-register block
in the Neon implementation for processing the tail because I was worried
about its performance on smaller buffers, but that part might get removed
if I can't measure any difference.

I'm planning to run several more benchmarks, but everything I've seen thus
far has looked pretty good.

--
nathan

Attachments:

v8-0001-Rename-TRY_POPCNT_FAST-to-POPCNT_X86_64.patchtext/plain; charset=us-asciiDownload
From c14a62c26196731aa2379babf535e698260f0066 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 21 Mar 2025 09:47:30 -0500
Subject: [PATCH v8 1/3] Rename TRY_POPCNT_FAST to POPCNT_X86_64.

---
 src/include/port/pg_bitutils.h |  6 +++---
 src/port/pg_bitutils.c         | 14 +++++++-------
 src/port/pg_popcount_avx512.c  |  8 ++++----
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685a..70bf65c04e4 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -294,11 +294,11 @@ pg_ceil_log2_64(uint64 num)
  */
 #ifdef HAVE_X86_64_POPCNTQ
 #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
-#define TRY_POPCNT_FAST 1
+#define POPCNT_X86_64 1
 #endif
 #endif
 
-#ifdef TRY_POPCNT_FAST
+#ifdef POPCNT_X86_64
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
@@ -322,7 +322,7 @@ extern int	pg_popcount64(uint64 word);
 extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* POPCNT_X86_64 */
 
 /*
  * Returns the number of 1-bits in buf.
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..34904c2fbd9 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -108,7 +108,7 @@ static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
-#ifdef TRY_POPCNT_FAST
+#ifdef POPCNT_X86_64
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
@@ -123,9 +123,9 @@ int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* POPCNT_X86_64 */
 
-#ifdef TRY_POPCNT_FAST
+#ifdef POPCNT_X86_64
 
 /*
  * Return true if CPUID indicates that the POPCNT instruction is available.
@@ -337,7 +337,7 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* POPCNT_X86_64 */
 
 
 /*
@@ -486,13 +486,13 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_FAST
+#ifndef POPCNT_X86_64
 
 /*
  * When the POPCNT instruction is not available, there's no point in using
  * function pointers to vary the implementation between the fast and slow
  * method.  We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined.  The compiler should be able to inline
+ * POPCNT_X86_64 is not defined.  The compiler should be able to inline
  * the slow versions here.
  */
 int
@@ -527,4 +527,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !TRY_POPCNT_FAST */
+#endif							/* !POPCNT_X86_64 */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dac895a0fc2..63f697ebea8 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -27,11 +27,11 @@
 #include "port/pg_bitutils.h"
 
 /*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * It's probably unlikely that POPCNT_X86_64 won't be set if we are able to
  * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ * the function pointers that are only used when POPCNT_X86_64 is set.
  */
-#ifdef TRY_POPCNT_FAST
+#ifdef POPCNT_X86_64
 
 /*
  * Does CPUID say there's support for XSAVE instructions?
@@ -219,5 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 	return _mm512_reduce_add_epi64(accum);
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* POPCNT_X86_64 */
 #endif							/* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.39.5 (Apple Git-154)

v8-0002-Neon-popcount-support.patchtext/plain; charset=us-asciiDownload
From 3ebc1321e6782919980d3410d3bc527fd77751fc Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 21 Mar 2025 11:04:26 -0500
Subject: [PATCH v8 2/3] Neon popcount support.

---
 src/include/port/pg_bitutils.h |   9 ++
 src/port/Makefile              |   1 +
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  22 +++-
 src/port/pg_popcount_aarch64.c | 203 +++++++++++++++++++++++++++++++++
 5 files changed, 230 insertions(+), 6 deletions(-)
 create mode 100644 src/port/pg_popcount_aarch64.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 70bf65c04e4..9aa07e5d574 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,15 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64, we can use Neon instructions if the compiler provides access to
+ * them (as indicated by __ARM_NEON).  As in simd.h, we assume that all
+ * available 64-bit hardware has Neon support.
+ */
+#if defined(__aarch64__) && defined(__ARM_NEON)
+#define POPCNT_AARCH64 1
+#endif
+
 #ifdef POPCNT_X86_64
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..cb86b7141e6 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_popcount_aarch64.o \
 	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..cad0dd8f4f8 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_aarch64.c',
   'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 34904c2fbd9..8b6f20b54e9 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,10 +103,15 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
+/*
+ * If we are building the Neon versions, we don't need the "slow" fallbacks.
+ */
+#ifndef POPCNT_AARCH64
 static inline int pg_popcount32_slow(uint32 word);
 static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
+#endif
 
 #ifdef POPCNT_X86_64
 static bool pg_popcount_available(void);
@@ -339,6 +344,10 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 
 #endif							/* POPCNT_X86_64 */
 
+/*
+ * If we are building the Neon versions, we don't need the "slow" fallbacks.
+ */
+#ifndef POPCNT_AARCH64
 
 /*
  * pg_popcount32_slow
@@ -486,14 +495,15 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef POPCNT_X86_64
+#endif							/* ! POPCNT_AARCH64 */
+
+#if !defined(POPCNT_X86_64) && !defined(POPCNT_AARCH64)
 
 /*
- * When the POPCNT instruction is not available, there's no point in using
+ * When special CPU instructions are not available, there's no point in using
  * function pointers to vary the implementation between the fast and slow
- * method.  We instead just make these actual external functions when
- * POPCNT_X86_64 is not defined.  The compiler should be able to inline
- * the slow versions here.
+ * method.  We instead just make these actual external functions.  The compiler
+ * should be able to inline the slow versions here.
  */
 int
 pg_popcount32(uint32 word)
@@ -527,4 +537,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !POPCNT_X86_64 */
+#endif							/* ! POPCNT_X86_64 && ! POPCNT_AARCH64 */
diff --git a/src/port/pg_popcount_aarch64.c b/src/port/pg_popcount_aarch64.c
new file mode 100644
index 00000000000..426bae660ef
--- /dev/null
+++ b/src/port/pg_popcount_aarch64.c
@@ -0,0 +1,203 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_aarc64.c
+ *	  Holds the AArch64 pg_popcount() implementations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_aarch64.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+#ifdef POPCNT_AARCH64
+
+#include <arm_neon.h>
+
+/*
+ * pg_popcount32
+ *		Return number of 1 bits in word
+ */
+int
+pg_popcount32(uint32 word)
+{
+	return pg_popcount64((uint64) word);
+}
+
+/*
+ * pg_popcount64
+ *		Return number of 1 bits in word
+ */
+int
+pg_popcount64(uint64 word)
+{
+	return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}
+
+/*
+ * pg_popcount_optimized
+ *		Returns number of 1 bits in buf
+ */
+uint64
+pg_popcount_optimized(const char *buf, int bytes)
+{
+	uint8x16_t	vec;
+	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0),
+				accum3 = vdupq_n_u64(0),
+				accum4 = vdupq_n_u64(0);
+	uint64		popcnt = 0;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		vec = vld1q_u8((const uint8 *) buf);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum3 = vpadalq_u32(accum3, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum4 = vpadalq_u32(accum4, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * sizeof(uint8x16_t);
+	if (bytes >= bytes_per_iteration)
+	{
+		vec = vld1q_u8((const uint8 *) buf);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+	popcnt += vaddvq_u64(vaddq_u64(accum3, accum4));
+
+	/*
+	 * Process remaining 8-byte blocks.
+	 */
+	for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+	{
+		popcnt += pg_popcount64(*((uint64 *) buf));
+		buf += sizeof(uint64);
+	}
+
+	/*
+	 * Process any remaining data byte-by-byte.
+	 */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_optimized
+ *		Returns number of 1 bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+	uint8x16_t	vec;
+	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0),
+				accum3 = vdupq_n_u64(0),
+				accum4 = vdupq_n_u64(0);
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	uint8x16_t	maskv = vdupq_n_u8(mask);
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum3 = vpadalq_u32(accum3, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum4 = vpadalq_u32(accum4, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * sizeof(uint8x16_t);
+	if (bytes >= bytes_per_iteration)
+	{
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+	popcnt += vaddvq_u64(vaddq_u64(accum3, accum4));
+
+	/*
+	 * Process remining 8-byte blocks.
+	 */
+	for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+	{
+		popcnt += pg_popcount64(*((uint64 *) buf) & mask64);
+		buf += sizeof(uint64);
+	}
+
+	/*
+	 * Process any remaining data byte-by-byte.
+	 */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+	return popcnt;
+}
+
+#endif							/* POPCNT_AARCH64 */
-- 
2.39.5 (Apple Git-154)

v8-0003-SVE-popcount-support.patchtext/plain; charset=us-asciiDownload
From 36f954a5735911af3e057f24d8803c32819e738d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 21 Mar 2025 20:24:44 -0500
Subject: [PATCH v8 3/3] SVE popcount support.

---
 config/c-compiler.m4           |  64 +++++++++
 configure                      |  84 ++++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  61 +++++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  17 +++
 src/port/pg_popcount_aarch64.c | 235 ++++++++++++++++++++++++++++++++-
 7 files changed, 467 insertions(+), 6 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3712e81e38c..d1e7461f6f6 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -708,3 +708,67 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_SVE_POPCNT_INTRINSICS
+# --------------------------
+# Check if the compiler supports the SVE popcount instructions using the
+# svptrue_b64, svdup_u64, svcntb, svld1, svadd_x, svcnt_x, svaddv,
+# svwhilelt_b8, and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <arm_sve.h>
+
+	char buf[500];
+
+	#if defined(__has_attribute) && __has_attribute (target)
+	__attribute__((target("arch=armv8-a+sve")))
+	#endif
+	static int popcount_test(void)
+	{
+		uint32_t	vec_len = svcntb();
+		int			bytes = sizeof(buf);
+		svuint64_t	accum1 = svdup_u64(0),
+					accum2 = svdup_u64(0);
+		svbool_t	pred = svptrue_b64();
+		uint64_t	popcnt = 0,
+					mask = 0x5555555555555555;
+		char	   *p = buf;
+
+		for (; bytes >= vec_len * 2; bytes -= vec_len * 2)
+		{
+			svuint64_t  vec;
+
+			vec = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+			accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+			p += vec_len;
+
+			vec = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+			accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+			p += vec_len;
+		}
+
+		popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+		for (; bytes >= vec_len; bytes -= vec_len)
+		{
+			svuint8_t   vec;
+
+			pred = svwhilelt_b8(0, bytes);
+			vec = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+			popcnt += svaddv(pred, svcnt_x(pred, vec));
+			p += vec_len;
+		}
+
+		return (int) popcnt;
+	}]],
+  [return popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index fac1e9a4e39..85f4b24caaa 100755
--- a/configure
+++ b/configure
@@ -17378,6 +17378,90 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x" >&5
+$as_echo_n "checking for svcnt_x... " >&6; }
+if ${pgac_cv_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+	char buf[500];
+
+	#if defined(__has_attribute) && __has_attribute (target)
+	__attribute__((target("arch=armv8-a+sve")))
+	#endif
+	static int popcount_test(void)
+	{
+		uint32_t	vec_len = svcntb();
+		int			bytes = sizeof(buf);
+		svuint64_t	accum1 = svdup_u64(0),
+					accum2 = svdup_u64(0);
+		svbool_t	pred = svptrue_b64();
+		uint64_t	popcnt = 0,
+					mask = 0x5555555555555555;
+		char	   *p = buf;
+
+		for (; bytes >= vec_len * 2; bytes -= vec_len * 2)
+		{
+			svuint64_t  vec;
+
+			vec = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+			accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+			p += vec_len;
+
+			vec = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+			accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+			p += vec_len;
+		}
+
+		popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+		for (; bytes >= vec_len; bytes -= vec_len)
+		{
+			svuint8_t   vec;
+
+			pred = svwhilelt_b8(0, bytes);
+			vec = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+			popcnt += svaddv(pred, svcnt_x(pred, vec));
+			p += vec_len;
+		}
+
+		return (int) popcnt;
+	}
+int
+main ()
+{
+return popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..64b52940658 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2057,6 +2057,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 7cf518a2765..de7e695ab6f 100644
--- a/meson.build
+++ b/meson.build
@@ -2285,6 +2285,67 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+char buf[500];
+
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+	uint32_t	vec_len = svcntb();
+	int			bytes = sizeof(buf);
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64_t	popcnt = 0,
+				mask = 0x5555555555555555;
+	char	   *p = buf;
+
+	for (; bytes >= vec_len * 2; bytes -= vec_len * 2)
+	{
+		svuint64_t	vec;
+
+		vec = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		p += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		p += vec_len;
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	for (; bytes >= vec_len; bytes -= vec_len)
+	{
+		svuint8_t 	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		p += vec_len;
+	}
+
+	return (int) popcnt;
+}
+'''
+
+  if cc.links(prog, name: 'SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..2a67db077a9 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -706,6 +706,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 9aa07e5d574..1bcb4ecb8ab 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -324,6 +324,23 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#elif POPCNT_AARCH64
+/* Use the Neon version of pg_popcount{32,64} without function pointer. */
+extern int	pg_popcount32(uint32 word);
+extern int	pg_popcount64(uint64 word);
+
+/*
+ * We can try to use an SVE-optimized pg_popcount() on some systems  For that,
+ * we do use a function pointer.
+ */
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
+#else
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/pg_popcount_aarch64.c b/src/port/pg_popcount_aarch64.c
index 426bae660ef..48441269639 100644
--- a/src/port/pg_popcount_aarch64.c
+++ b/src/port/pg_popcount_aarch64.c
@@ -18,6 +18,229 @@
 
 #include <arm_neon.h>
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+#endif
+
+/*
+ * The Neon versions are built regardless of whether we are building the SVE
+ * versions.
+ */
+static uint64 pg_popcount_neon(const char *buf, int bytes);
+static uint64 pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask);
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * These are the SVE implementations of the popcount functions.
+ */
+static uint64 pg_popcount_sve(const char *buf, int bytes);
+static uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+
+/*
+ * The function pointers are initially set to "choose" functions.  These
+ * functions will first set the pointers to the right implementations (based on
+ * what the current CPU supports) and then will call the pointer to fulfill the
+ * caller's request.
+ */
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
+
+static inline bool
+pg_popcount_sve_available(void)
+{
+#ifdef HAVE_ELF_AUX_INFO
+	unsigned long value;
+
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL)
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_popcount_functions(void)
+{
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
+	else
+	{
+		pg_popcount_optimized = pg_popcount_neon;
+		pg_popcount_masked_optimized = pg_popcount_masked_neon;
+	}
+}
+
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	choose_popcount_functions();
+	return pg_popcount_optimized(buf, bytes);
+}
+
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+	choose_popcount_functions();
+	return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns number of 1 bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	uint32		vec_len = svcntb(),
+				bytes_per_iteration = 4 * vec_len;
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				accum3 = svdup_u64(0),
+				accum4 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64		popcnt = 0;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+	popcnt += svaddv(pred, svadd_x(pred, accum3, accum4));
+
+	/*
+	 * Process any remaining data.
+	 */
+	for (; bytes >= vec_len; bytes -= vec_len)
+	{
+		svuint8_t	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svld1(pred, (const uint8 *) buf);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns number of 1 bits in buf after applying the mask to each byte
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	uint32		vec_len = svcntb(),
+				bytes_per_iteration = 4 * vec_len;
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				accum3 = svdup_u64(0),
+				accum4 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+	popcnt += svaddv(pred, svadd_x(pred, accum3, accum4));
+
+	/*
+	 * Process any remaining data.
+	 */
+	for (; bytes >= vec_len; bytes -= vec_len)
+	{
+		svuint8_t	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svand_x(pred, svld1(pred, (const uint8 *) buf), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	return popcnt;
+}
+
+#else							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
+/*
+ * When the SVE version isn't available, there's no point in using function
+ * pointers to vary the implementation.  We instead just make these actual
+ * external functions when USE_SVE_POPCNT_WITH_RUNTIME_CHECK is not defined.
+ * The compiler should be able to inline the slow versions here.
+ */
+uint64
+pg_popcount_optimized(const char *buf, int bytes)
+{
+	return pg_popcount_neon(buf, bytes);
+}
+
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+	return pg_popcount_masked_neon(buf, bytes, mask);
+}
+
+#endif							/* ! USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
 /*
  * pg_popcount32
  *		Return number of 1 bits in word
@@ -39,11 +262,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount_optimized
+ * pg_popcount_neon
  *		Returns number of 1 bits in buf
  */
-uint64
-pg_popcount_optimized(const char *buf, int bytes)
+static uint64
+pg_popcount_neon(const char *buf, int bytes)
 {
 	uint8x16_t	vec;
 	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
@@ -119,11 +342,11 @@ pg_popcount_optimized(const char *buf, int bytes)
 }
 
 /*
- * pg_popcount_masked_optimized
+ * pg_popcount_masked_neon
  *		Returns number of 1 bits in buf after applying the mask to each byte
  */
-uint64
-pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+static uint64
+pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask)
 {
 	uint8x16_t	vec;
 	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
-- 
2.39.5 (Apple Git-154)

#28Chiranmoy.Bhattacharya@fujitsu.com
Chiranmoy.Bhattacharya@fujitsu.com
In reply to: Nathan Bossart (#27)
Re: [PATCH] SVE popcount support

Looks good, the code is more readable now.

For both Neon and SVE, I do see improvements with looping over 4
registers at a time, so IMHO it's worth doing so even if it performs the
same as 2-register blocks on some hardware.

There was no regression on Graviton 3 when using the 4-register version so can keep it.

-Chiranmoy

#29John Naylor
johncnaylorls@gmail.com
In reply to: Nathan Bossart (#27)
Re: [PATCH] SVE popcount support

On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

* 0002 introduces the Neon implementation, which conveniently doesn't need
configure-time checks or function pointers. I noticed that some
compilers (e.g., Apple clang 16) compile in Neon instructions already,
but our hand-rolled implementation is better about instruction-level
parallelism and seems to still be quite a bit faster.

+pg_popcount64(uint64 word)
+{
+ return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}

This confused me until I found that this is what
__builtin_popcountl(word) would emit anyway. Worth a comment?

Some thoughts to consider, some speculative and maybe not worth
putting time into:

I did add a 2-register block
in the Neon implementation for processing the tail because I was worried
about its performance on smaller buffers, but that part might get removed
if I can't measure any difference.

Even if we can measure a difference on fixed-sized inputs, that might
not carry over when the branch is unpredictable.

+ /*
+ * Process remaining 8-byte blocks.
+ */
+ for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+ {
+ popcnt += pg_popcount64(*((uint64 *) buf));
+ buf += sizeof(uint64);
+ }

This uses 16-byte registers, but only loads 8-bytes at a time (with
accumulation work), then a bytewise tail up to 7 bytes. Alternatively,
you could instead do a loop over a single local accumulator, which I
think could have a short accumulation pipeline since 3 iterations
can't overflow 8-bit lanes. But then the bytewise tail could be up to
15 bytes.

* 0003 introduces the SVE implementation. You'll notice I've moved all the
function pointer gymnastics into the pg_popcount_aarch64.c file, which is
where the Neon implementations live, too. I also tried to clean up the
configure checks a bit. I imagine it's possible to make them more
compact, but I felt that the enhanced readability was worth it.

I don't know what the configure checks looked like before, but I'm
confused that the loops are unrolled in the link-test functions as
well.

* For both Neon and SVE, I do see improvements with looping over 4
registers at a time, so IMHO it's worth doing so even if it performs the
same as 2-register blocks on some hardware.

I wonder if alignment matters for these larger blocks.

--
John Naylor
Amazon Web Services

#30Nathan Bossart
nathandbossart@gmail.com
In reply to: John Naylor (#29)
Re: [PATCH] SVE popcount support

On Mon, Mar 24, 2025 at 06:34:45PM +0700, John Naylor wrote:

On Sat, Mar 22, 2025 at 10:42 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

* 0002 introduces the Neon implementation, which conveniently doesn't need
configure-time checks or function pointers. I noticed that some
compilers (e.g., Apple clang 16) compile in Neon instructions already,
but our hand-rolled implementation is better about instruction-level
parallelism and seems to still be quite a bit faster.

+pg_popcount64(uint64 word)
+{
+ return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}

This confused me until I found that this is what
__builtin_popcountl(word) would emit anyway. Worth a comment?

Sure thing.

+ /*
+ * Process remaining 8-byte blocks.
+ */
+ for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+ {
+ popcnt += pg_popcount64(*((uint64 *) buf));
+ buf += sizeof(uint64);
+ }

This uses 16-byte registers, but only loads 8-bytes at a time (with
accumulation work), then a bytewise tail up to 7 bytes. Alternatively,
you could instead do a loop over a single local accumulator, which I
think could have a short accumulation pipeline since 3 iterations
can't overflow 8-bit lanes. But then the bytewise tail could be up to
15 bytes.

Yeah, I wasn't sure how far we wanted to go with this. We could do 4
registers at a time, then 2, then 1, then 8-bytes, then byte-by-byte, but
that's quite a few extra lines of code for the amount of gain, not to
mention the extra overhead. My inclination was to try to keep this as
simple as possible while making sure we didn't regress on small inputs.

* 0003 introduces the SVE implementation. You'll notice I've moved all the
function pointer gymnastics into the pg_popcount_aarch64.c file, which is
where the Neon implementations live, too. I also tried to clean up the
configure checks a bit. I imagine it's possible to make them more
compact, but I felt that the enhanced readability was worth it.

I don't know what the configure checks looked like before, but I'm
confused that the loops are unrolled in the link-test functions as
well.

We do need the two separate blocks because they use different intrinsic
functions, but I could probably remove the actual "for" loops themselves to
simplify things a bit.

* For both Neon and SVE, I do see improvements with looping over 4
registers at a time, so IMHO it's worth doing so even if it performs the
same as 2-register blocks on some hardware.

I wonder if alignment matters for these larger blocks.

Some earlier benchmarks didn't show anything outside of the noise range
[0]: /messages/by-id/OSBPR01MB266403FD4C05DAB58EBBA82897EF2@OSBPR01MB2664.jpnprd01.prod.outlook.com

[0]: /messages/by-id/OSBPR01MB266403FD4C05DAB58EBBA82897EF2@OSBPR01MB2664.jpnprd01.prod.outlook.com

--
nathan

#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#30)
3 attachment(s)
Re: [PATCH] SVE popcount support

I've attached a new set of patches in which I've tried to address John's
feedback. I ran some new benchmarks with these patches. "M3" is an Apple
M3 (my laptop), "G3" is an r7g.4xlarge, and "G4" is an r8g.4xlarge. "no
SVE" means the patches are applied but the function pointer points to the
Neon implementation. "SVE" and "patched" mean all the patches are applied
with no changes.

8 byte words | M3 HEAD | M3 patched | G3 HEAD | G3 no SVE | G3 SVE | G4 HEAD | G4 no SVE | G4 SVE
--------------+---------+------------+---------+-----------+---------+---------+-----------+---------
1 | 3.6 | 3.0 | 3.1 | 2.9 | 3.1 | 2.5 | 2.2 | 1.8
2 | 6.4 | 4.4 | 3.1 | 3.0 | 3.1 | 2.5 | 2.5 | 2.0
3 | 7.3 | 6.9 | 3.5 | 3.5 | 3.1 | 3.3 | 3.2 | 2.0
4 | 8.0 | 3.8 | 4.0 | 2.7 | 4.7 | 3.6 | 2.2 | 2.7
5 | 9.4 | 5.5 | 4.6 | 2.8 | 4.6 | 3.9 | 2.5 | 2.7
6 | 7.9 | 5.0 | 5.1 | 3.5 | 4.7 | 4.3 | 3.1 | 3.4
7 | 10.2 | 7.4 | 5.9 | 4.0 | 4.7 | 4.7 | 3.6 | 3.4
8 | 12.0 | 5.4 | 6.5 | 4.0 | 5.9 | 5.0 | 3.2 | 2.5
9 | 11.7 | 6.5 | 7.2 | 4.3 | 5.9 | 5.4 | 3.6 | 2.5
10 | 12.5 | 5.4 | 8.0 | 4.8 | 5.9 | 6.2 | 3.9 | 3.1
11 | 14.0 | 8.6 | 8.5 | 5.5 | 5.9 | 6.1 | 5.0 | 3.1
12 | 13.1 | 5.7 | 9.1 | 5.1 | 7.4 | 6.4 | 3.9 | 3.6
13 | 12.1 | 6.8 | 9.8 | 5.4 | 7.3 | 6.8 | 4.3 | 3.6
14 | 16.4 | 7.8 | 10.4 | 5.9 | 7.4 | 7.2 | 4.7 | 4.4
15 | 17.4 | 8.0 | 11.1 | 6.6 | 7.4 | 7.5 | 5.7 | 4.4
16 | 15.5 | 5.7 | 11.8 | 5.7 | 4.7 | 7.9 | 5.0 | 3.5
32 | 26.0 | 16.2 | 22.7 | 10.3 | 6.2 | 16.8 | 8.4 | 5.2
64 | 38.5 | 20.3 | 42.7 | 20.1 | 9.3 | 31.8 | 15.4 | 8.8
128 | 75.1 | 35.7 | 86.1 | 35.0 | 15.4 | 80.2 | 28.6 | 16.3
256 | 117.7 | 51.8 | 179.6 | 68.2 | 27.8 | 154.0 | 55.7 | 30.9
512 | 198.5 | 93.1 | 329.3 | 134.4 | 52.4 | 246.5 | 110.2 | 59.4
1024 | 355.0 | 159.2 | 673.6 | 265.8 | 101.7 | 487.0 | 219.0 | 114.7
2048 | 669.5 | 288.8 | 1294.7 | 529.7 | 200.3 | 969.3 | 438.7 | 228.5
4096 | 1308.0 | 552.8 | 2784.3 | 1063.0 | 397.4 | 1934.5 | 874.4 | 455.9

IMHO these are acceptable results, at least for the use-cases I see in the
tree. We might be able to minimize the difference between the Neon and SVE
implementations on the low end with some additional code, but I'm really
not sure if it's worth the effort.

Barring feedback or objections, I'm planning to commit these on Friday.

--
nathan

Attachments:

v9-0001-Rename-TRY_POPCNT_FAST-to-TRY_POPCNT_X86_64.patchtext/plain; charset=us-asciiDownload
From 1a8d7b9552efa3bbbbde23be4b18b8031520150a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Mar 2025 19:48:41 -0500
Subject: [PATCH v9 1/3] Rename TRY_POPCNT_FAST to TRY_POPCNT_X86_64.

This macro guards x86_64-specific code, and a follow-up commit will
add AArch64-specific versions of that code.  To avoid confusion,
let's rename TRY_POPCNT_FAST to make it more obvious that it's for
x86_64.

Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
---
 src/include/port/pg_bitutils.h |  6 +++---
 src/port/pg_bitutils.c         | 14 +++++++-------
 src/port/pg_popcount_avx512.c  |  8 ++++----
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685a..3067ff402ba 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -294,11 +294,11 @@ pg_ceil_log2_64(uint64 num)
  */
 #ifdef HAVE_X86_64_POPCNTQ
 #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
-#define TRY_POPCNT_FAST 1
+#define TRY_POPCNT_X86_64 1
 #endif
 #endif
 
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
@@ -322,7 +322,7 @@ extern int	pg_popcount64(uint64 word);
 extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 
 /*
  * Returns the number of 1-bits in buf.
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..82be40e2fb4 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -108,7 +108,7 @@ static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
@@ -123,9 +123,9 @@ int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 
 /*
  * Return true if CPUID indicates that the POPCNT instruction is available.
@@ -337,7 +337,7 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 
 
 /*
@@ -486,13 +486,13 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_FAST
+#ifndef TRY_POPCNT_X86_64
 
 /*
  * When the POPCNT instruction is not available, there's no point in using
  * function pointers to vary the implementation between the fast and slow
  * method.  We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined.  The compiler should be able to inline
+ * TRY_POPCNT_X86_64 is not defined.  The compiler should be able to inline
  * the slow versions here.
  */
 int
@@ -527,4 +527,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !TRY_POPCNT_FAST */
+#endif							/* !TRY_POPCNT_X86_64 */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dac895a0fc2..80c0aee3e73 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -27,11 +27,11 @@
 #include "port/pg_bitutils.h"
 
 /*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * It's probably unlikely that TRY_POPCNT_X86_64 won't be set if we are able to
  * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ * the function pointers that are only used when TRY_POPCNT_X86_64 is set.
  */
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 
 /*
  * Does CPUID say there's support for XSAVE instructions?
@@ -219,5 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 	return _mm512_reduce_add_epi64(accum);
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 #endif							/* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.39.5 (Apple Git-154)

v9-0002-Add-Neon-popcount-support.patchtext/plain; charset=us-asciiDownload
From 5953da8e6c4d167954cbedfca58bd7558feb8620 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Mar 2025 20:10:23 -0500
Subject: [PATCH v9 2/3] Add Neon popcount support.

This commit introduces a Neon implementation of pg_popcount{32,64},
pg_popcount(), and pg_popcount_masked().  As in simd.h, we assume
that all available AArch64 hardware supports Neon, so we
conveniently don't need any new configure-time or runtime checks.
Some compilers emit Neon instructions for these functions already,
but our hand-rolled implementations for pg_popcount() and
pg_popcount_masked() performed better in our tests, presumably due
to the instruction-level parallelism.

Author: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
---
 src/include/port/pg_bitutils.h |   9 ++
 src/port/Makefile              |   1 +
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  22 +++-
 src/port/pg_popcount_aarch64.c | 208 +++++++++++++++++++++++++++++++++
 5 files changed, 235 insertions(+), 6 deletions(-)
 create mode 100644 src/port/pg_popcount_aarch64.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 3067ff402ba..a387f77c2c0 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,15 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64, we can use Neon instructions if the compiler provides access to
+ * them (as indicated by __ARM_NEON).  As in simd.h, we assume that all
+ * available 64-bit hardware has Neon support.
+ */
+#if defined(__aarch64__) && defined(__ARM_NEON)
+#define POPCNT_AARCH64 1
+#endif
+
 #ifdef TRY_POPCNT_X86_64
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..cb86b7141e6 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_popcount_aarch64.o \
 	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..cad0dd8f4f8 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_aarch64.c',
   'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 82be40e2fb4..61c7388f474 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,10 +103,15 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
+/*
+ * If we are building the Neon versions, we don't need the "slow" fallbacks.
+ */
+#ifndef POPCNT_AARCH64
 static inline int pg_popcount32_slow(uint32 word);
 static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
+#endif
 
 #ifdef TRY_POPCNT_X86_64
 static bool pg_popcount_available(void);
@@ -339,6 +344,10 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 
 #endif							/* TRY_POPCNT_X86_64 */
 
+/*
+ * If we are building the Neon versions, we don't need the "slow" fallbacks.
+ */
+#ifndef POPCNT_AARCH64
 
 /*
  * pg_popcount32_slow
@@ -486,14 +495,15 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_X86_64
+#endif							/* ! POPCNT_AARCH64 */
+
+#if !defined(TRY_POPCNT_X86_64) && !defined(POPCNT_AARCH64)
 
 /*
- * When the POPCNT instruction is not available, there's no point in using
+ * When special CPU instructions are not available, there's no point in using
  * function pointers to vary the implementation between the fast and slow
- * method.  We instead just make these actual external functions when
- * TRY_POPCNT_X86_64 is not defined.  The compiler should be able to inline
- * the slow versions here.
+ * method.  We instead just make these actual external functions.  The compiler
+ * should be able to inline the slow versions here.
  */
 int
 pg_popcount32(uint32 word)
@@ -527,4 +537,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !TRY_POPCNT_X86_64 */
+#endif							/* ! TRY_POPCNT_X86_64 && ! POPCNT_AARCH64 */
diff --git a/src/port/pg_popcount_aarch64.c b/src/port/pg_popcount_aarch64.c
new file mode 100644
index 00000000000..cdcfee464e4
--- /dev/null
+++ b/src/port/pg_popcount_aarch64.c
@@ -0,0 +1,208 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_aarc64.c
+ *	  Holds the AArch64 pg_popcount() implementations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_aarch64.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+#ifdef POPCNT_AARCH64
+
+#include <arm_neon.h>
+
+/*
+ * pg_popcount32
+ *		Return number of 1 bits in word
+ */
+int
+pg_popcount32(uint32 word)
+{
+	return pg_popcount64((uint64) word);
+}
+
+/*
+ * pg_popcount64
+ *		Return number of 1 bits in word
+ */
+int
+pg_popcount64(uint64 word)
+{
+	/*
+	 * For some compilers, __builtin_popcountl() emits Neon instructions
+	 * already. The line below should compile to the same code on those
+	 * systems.
+	 */
+	return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}
+
+/*
+ * pg_popcount_optimized
+ *		Returns number of 1 bits in buf
+ */
+uint64
+pg_popcount_optimized(const char *buf, int bytes)
+{
+	uint8x16_t	vec;
+	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0),
+				accum3 = vdupq_n_u64(0),
+				accum4 = vdupq_n_u64(0);
+	uint64		popcnt = 0;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		vec = vld1q_u8((const uint8 *) buf);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum3 = vpadalq_u32(accum3, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum4 = vpadalq_u32(accum4, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * sizeof(uint8x16_t);
+	if (bytes >= bytes_per_iteration)
+	{
+		vec = vld1q_u8((const uint8 *) buf);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+	popcnt += vaddvq_u64(vaddq_u64(accum3, accum4));
+
+	/*
+	 * Process remaining 8-byte blocks.
+	 */
+	for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+	{
+		popcnt += pg_popcount64(*((uint64 *) buf));
+		buf += sizeof(uint64);
+	}
+
+	/*
+	 * Process any remaining data byte-by-byte.
+	 */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_optimized
+ *		Returns number of 1 bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+	uint8x16_t	vec;
+	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0),
+				accum3 = vdupq_n_u64(0),
+				accum4 = vdupq_n_u64(0);
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	uint8x16_t	maskv = vdupq_n_u8(mask);
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum3 = vpadalq_u32(accum3, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum4 = vpadalq_u32(accum4, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * sizeof(uint8x16_t);
+	if (bytes >= bytes_per_iteration)
+	{
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+	popcnt += vaddvq_u64(vaddq_u64(accum3, accum4));
+
+	/*
+	 * Process remining 8-byte blocks.
+	 */
+	for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+	{
+		popcnt += pg_popcount64(*((uint64 *) buf) & mask64);
+		buf += sizeof(uint64);
+	}
+
+	/*
+	 * Process any remaining data byte-by-byte.
+	 */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+	return popcnt;
+}
+
+#endif							/* POPCNT_AARCH64 */
-- 
2.39.5 (Apple Git-154)

v9-0003-Add-SVE-popcount-support.patchtext/plain; charset=us-asciiDownload
From 1b2c3a8101fb7a3844de4594141492f72981af12 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Mar 2025 20:30:22 -0500
Subject: [PATCH v9 3/3] Add SVE popcount support.

This commit introduces an SVE implementation of pg_popcount{32,64}.
Unlike Neon support, we need an additional configure-time check to
discover whether the compiler supports SVE intrinsics, and we need
a runtime check to find whether the current CPU supports SVE
instructions.  While this commit introduces a new function pointer
so that the implementation can be chosen at runtime, the
AArch64-specific implementations are fast enough to avoid any
measurable regressions as compared to previous versions of
PostgreSQL.  The SVE implementations are much faster for larger
inputs, including the uses for the visibility map.

Author: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Co-authored-by: "Malladi, Rama" <ramamalladi@hotmail.com>
Co-authored-by: "Devanga.Susmitha@fujitsu.com" <Devanga.Susmitha@fujitsu.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
Discussion: https://postgr.es/m/OSZPR01MB84990A9A02A3515C6E85A65B8B2A2%40OSZPR01MB8499.jpnprd01.prod.outlook.com
---
 config/c-compiler.m4           |  53 ++++++++
 configure                      |  73 ++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  50 +++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  17 +++
 src/port/pg_popcount_aarch64.c | 235 ++++++++++++++++++++++++++++++++-
 7 files changed, 434 insertions(+), 6 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3712e81e38c..8490354a1e0 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -708,3 +708,56 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_SVE_POPCNT_INTRINSICS
+# --------------------------
+# Check if the compiler supports the SVE popcount instructions using the
+# svptrue_b64, svdup_u64, svcntb, svld1, svadd_x, svcnt_x, svaddv,
+# svwhilelt_b8, and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <arm_sve.h>
+
+	char buf[128];
+
+	#if defined(__has_attribute) && __has_attribute (target)
+	__attribute__((target("arch=armv8-a+sve")))
+	#endif
+	static int popcount_test(void)
+	{
+		uint32_t	vec_len = svcntb();
+		int			bytes = sizeof(buf);
+		svuint64_t	accum1 = svdup_u64(0),
+					accum2 = svdup_u64(0),
+					vec64;
+		svuint8_t	vec8;
+		svbool_t	pred = svptrue_b64();
+		uint64_t	popcnt = 0,
+					mask = 0x5555555555555555;
+		char	   *p = buf;
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		p += vec_len;
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+		p += vec_len;
+
+		popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+		pred = svwhilelt_b8(0, bytes);
+		vec8 = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+		return (int) (popcnt + svaddv(pred, svcnt_x(pred, vec8)));
+	}]],
+  [return popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index fac1e9a4e39..2e291f97c99 100755
--- a/configure
+++ b/configure
@@ -17378,6 +17378,79 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x" >&5
+$as_echo_n "checking for svcnt_x... " >&6; }
+if ${pgac_cv_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+	char buf[128];
+
+	#if defined(__has_attribute) && __has_attribute (target)
+	__attribute__((target("arch=armv8-a+sve")))
+	#endif
+	static int popcount_test(void)
+	{
+		uint32_t	vec_len = svcntb();
+		int			bytes = sizeof(buf);
+		svuint64_t	accum1 = svdup_u64(0),
+					accum2 = svdup_u64(0),
+					vec64;
+		svuint8_t	vec8;
+		svbool_t	pred = svptrue_b64();
+		uint64_t	popcnt = 0,
+					mask = 0x5555555555555555;
+		char	   *p = buf;
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		p += vec_len;
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+		p += vec_len;
+
+		popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+		pred = svwhilelt_b8(0, bytes);
+		vec8 = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+		return (int) (popcnt + svaddv(pred, svcnt_x(pred, vec8)));
+	}
+int
+main ()
+{
+return popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index b6d02f5ecc7..64b52940658 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2057,6 +2057,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 7cf518a2765..f8f1dce6bc9 100644
--- a/meson.build
+++ b/meson.build
@@ -2285,6 +2285,56 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+char buf[128];
+
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+	uint32_t	vec_len = svcntb();
+	int			bytes = sizeof(buf);
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				vec64;
+	svuint8_t	vec8;
+	svbool_t	pred = svptrue_b64();
+	uint64_t	popcnt = 0,
+				mask = 0x5555555555555555;
+	char	   *p = buf;
+
+	vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+	accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+	p += vec_len;
+
+	vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+	accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	p += vec_len;
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	pred = svwhilelt_b8(0, bytes);
+	vec8 = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+	return (int) (popcnt + svaddv(pred, svcnt_x(pred, vec8)));
+}
+'''
+
+  if cc.links(prog, name: 'SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index db6454090d2..2a67db077a9 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -706,6 +706,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a387f77c2c0..c7901bf8ddc 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -324,6 +324,23 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#elif POPCNT_AARCH64
+/* Use the Neon version of pg_popcount{32,64} without function pointer. */
+extern int	pg_popcount32(uint32 word);
+extern int	pg_popcount64(uint64 word);
+
+/*
+ * We can try to use an SVE-optimized pg_popcount() on some systems  For that,
+ * we do use a function pointer.
+ */
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
+#else
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/pg_popcount_aarch64.c b/src/port/pg_popcount_aarch64.c
index cdcfee464e4..2b7a2f97b83 100644
--- a/src/port/pg_popcount_aarch64.c
+++ b/src/port/pg_popcount_aarch64.c
@@ -18,6 +18,229 @@
 
 #include <arm_neon.h>
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+#endif
+
+/*
+ * The Neon versions are built regardless of whether we are building the SVE
+ * versions.
+ */
+static uint64 pg_popcount_neon(const char *buf, int bytes);
+static uint64 pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask);
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * These are the SVE implementations of the popcount functions.
+ */
+static uint64 pg_popcount_sve(const char *buf, int bytes);
+static uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+
+/*
+ * The function pointers are initially set to "choose" functions.  These
+ * functions will first set the pointers to the right implementations (based on
+ * what the current CPU supports) and then will call the pointer to fulfill the
+ * caller's request.
+ */
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
+
+static inline bool
+pg_popcount_sve_available(void)
+{
+#ifdef HAVE_ELF_AUX_INFO
+	unsigned long value;
+
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL)
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_popcount_functions(void)
+{
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
+	else
+	{
+		pg_popcount_optimized = pg_popcount_neon;
+		pg_popcount_masked_optimized = pg_popcount_masked_neon;
+	}
+}
+
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	choose_popcount_functions();
+	return pg_popcount_optimized(buf, bytes);
+}
+
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+	choose_popcount_functions();
+	return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns number of 1 bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	uint32		vec_len = svcntb(),
+				bytes_per_iteration = 4 * vec_len;
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				accum3 = svdup_u64(0),
+				accum4 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64		popcnt = 0;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+	popcnt += svaddv(pred, svadd_x(pred, accum3, accum4));
+
+	/*
+	 * Process any remaining data.
+	 */
+	for (; bytes >= vec_len; bytes -= vec_len)
+	{
+		svuint8_t	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svld1(pred, (const uint8 *) buf);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns number of 1 bits in buf after applying the mask to each byte
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	uint32		vec_len = svcntb(),
+				bytes_per_iteration = 4 * vec_len;
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				accum3 = svdup_u64(0),
+				accum4 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+	popcnt += svaddv(pred, svadd_x(pred, accum3, accum4));
+
+	/*
+	 * Process any remaining data.
+	 */
+	for (; bytes >= vec_len; bytes -= vec_len)
+	{
+		svuint8_t	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svand_x(pred, svld1(pred, (const uint8 *) buf), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	return popcnt;
+}
+
+#else							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
+/*
+ * When the SVE version isn't available, there's no point in using function
+ * pointers to vary the implementation.  We instead just make these actual
+ * external functions when USE_SVE_POPCNT_WITH_RUNTIME_CHECK is not defined.
+ * The compiler should be able to inline the slow versions here.
+ */
+uint64
+pg_popcount_optimized(const char *buf, int bytes)
+{
+	return pg_popcount_neon(buf, bytes);
+}
+
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+	return pg_popcount_masked_neon(buf, bytes, mask);
+}
+
+#endif							/* ! USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
 /*
  * pg_popcount32
  *		Return number of 1 bits in word
@@ -44,11 +267,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount_optimized
+ * pg_popcount_neon
  *		Returns number of 1 bits in buf
  */
-uint64
-pg_popcount_optimized(const char *buf, int bytes)
+static uint64
+pg_popcount_neon(const char *buf, int bytes)
 {
 	uint8x16_t	vec;
 	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
@@ -124,11 +347,11 @@ pg_popcount_optimized(const char *buf, int bytes)
 }
 
 /*
- * pg_popcount_masked_optimized
+ * pg_popcount_masked_neon
  *		Returns number of 1 bits in buf after applying the mask to each byte
  */
-uint64
-pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+static uint64
+pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask)
 {
 	uint8x16_t	vec;
 	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
-- 
2.39.5 (Apple Git-154)

#32Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#31)
3 attachment(s)
Re: [PATCH] SVE popcount support

On Wed, Mar 26, 2025 at 04:44:24PM -0500, Nathan Bossart wrote:

IMHO these are acceptable results, at least for the use-cases I see in the
tree. We might be able to minimize the difference between the Neon and SVE
implementations on the low end with some additional code, but I'm really
not sure if it's worth the effort.

I couldn't resist... I tried a variety of things (e.g., inlining the Neon
implementation to process the tail, jumping to the Neon implementation for
smaller inputs), and the only thing that seemed to be a clear win was to
add a 2-register block in the SVE implementations (like what is already
there for the Neon ones). In particular, that helps bring the Graviton3
SVE numbers closer to the Neon numbers for inputs between 8-16 8-byte
words.

I also noticed a silly mistake in 0003 that would cause us to potentially
skip part of the tail. That should be fixed now.

--
nathan

Attachments:

v10-0001-Rename-TRY_POPCNT_FAST-to-TRY_POPCNT_X86_64.patchtext/plain; charset=us-asciiDownload
From e938de4a8f1bf1b6b1aec05ec9d753621e37746f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Mar 2025 19:48:41 -0500
Subject: [PATCH v10 1/3] Rename TRY_POPCNT_FAST to TRY_POPCNT_X86_64.

This macro guards x86_64-specific code, and a follow-up commit will
add AArch64-specific versions of that code.  To avoid confusion,
let's rename TRY_POPCNT_FAST to make it more obvious that it's for
x86_64.

Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
---
 src/include/port/pg_bitutils.h |  6 +++---
 src/port/pg_bitutils.c         | 14 +++++++-------
 src/port/pg_popcount_avx512.c  |  8 ++++----
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 62554ce685a..3067ff402ba 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -294,11 +294,11 @@ pg_ceil_log2_64(uint64 num)
  */
 #ifdef HAVE_X86_64_POPCNTQ
 #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
-#define TRY_POPCNT_FAST 1
+#define TRY_POPCNT_X86_64 1
 #endif
 #endif
 
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
@@ -322,7 +322,7 @@ extern int	pg_popcount64(uint64 word);
 extern uint64 pg_popcount_optimized(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 
 /*
  * Returns the number of 1-bits in buf.
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 5677525693d..82be40e2fb4 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -108,7 +108,7 @@ static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
 
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
@@ -123,9 +123,9 @@ int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
 uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
 uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 
 /*
  * Return true if CPUID indicates that the POPCNT instruction is available.
@@ -337,7 +337,7 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 
 
 /*
@@ -486,13 +486,13 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_FAST
+#ifndef TRY_POPCNT_X86_64
 
 /*
  * When the POPCNT instruction is not available, there's no point in using
  * function pointers to vary the implementation between the fast and slow
  * method.  We instead just make these actual external functions when
- * TRY_POPCNT_FAST is not defined.  The compiler should be able to inline
+ * TRY_POPCNT_X86_64 is not defined.  The compiler should be able to inline
  * the slow versions here.
  */
 int
@@ -527,4 +527,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !TRY_POPCNT_FAST */
+#endif							/* !TRY_POPCNT_X86_64 */
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dac895a0fc2..80c0aee3e73 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -27,11 +27,11 @@
 #include "port/pg_bitutils.h"
 
 /*
- * It's probably unlikely that TRY_POPCNT_FAST won't be set if we are able to
+ * It's probably unlikely that TRY_POPCNT_X86_64 won't be set if we are able to
  * use AVX-512 intrinsics, but we check it anyway to be sure.  We piggy-back on
- * the function pointers that are only used when TRY_POPCNT_FAST is set.
+ * the function pointers that are only used when TRY_POPCNT_X86_64 is set.
  */
-#ifdef TRY_POPCNT_FAST
+#ifdef TRY_POPCNT_X86_64
 
 /*
  * Does CPUID say there's support for XSAVE instructions?
@@ -219,5 +219,5 @@ pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask)
 	return _mm512_reduce_add_epi64(accum);
 }
 
-#endif							/* TRY_POPCNT_FAST */
+#endif							/* TRY_POPCNT_X86_64 */
 #endif							/* USE_AVX512_POPCNT_WITH_RUNTIME_CHECK */
-- 
2.39.5 (Apple Git-154)

v10-0002-Add-Neon-popcount-support.patchtext/plain; charset=us-asciiDownload
From ee81eded16a5b7987b0fdf180f6a411bef2810b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Mon, 24 Mar 2025 20:10:23 -0500
Subject: [PATCH v10 2/3] Add Neon popcount support.

This commit introduces a Neon implementation of pg_popcount{32,64},
pg_popcount(), and pg_popcount_masked().  As in simd.h, we assume
that all available AArch64 hardware supports Neon, so we
conveniently don't need any new configure-time or runtime checks.
Some compilers emit Neon instructions for these functions already,
but our hand-rolled implementations for pg_popcount() and
pg_popcount_masked() performed better in our tests, presumably due
to the instruction-level parallelism.

Author: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
---
 src/include/port/pg_bitutils.h |   9 ++
 src/port/Makefile              |   1 +
 src/port/meson.build           |   1 +
 src/port/pg_bitutils.c         |  22 +++-
 src/port/pg_popcount_aarch64.c | 208 +++++++++++++++++++++++++++++++++
 5 files changed, 235 insertions(+), 6 deletions(-)
 create mode 100644 src/port/pg_popcount_aarch64.c

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 3067ff402ba..a387f77c2c0 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -298,6 +298,15 @@ pg_ceil_log2_64(uint64 num)
 #endif
 #endif
 
+/*
+ * On AArch64, we can use Neon instructions if the compiler provides access to
+ * them (as indicated by __ARM_NEON).  As in simd.h, we assume that all
+ * available 64-bit hardware has Neon support.
+ */
+#if defined(__aarch64__) && defined(__ARM_NEON)
+#define POPCNT_AARCH64 1
+#endif
+
 #ifdef TRY_POPCNT_X86_64
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
diff --git a/src/port/Makefile b/src/port/Makefile
index 4c224319512..cb86b7141e6 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -44,6 +44,7 @@ OBJS = \
 	noblock.o \
 	path.o \
 	pg_bitutils.o \
+	pg_popcount_aarch64.o \
 	pg_popcount_avx512.o \
 	pg_strong_random.o \
 	pgcheckdir.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 7fcfa728d43..cad0dd8f4f8 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -7,6 +7,7 @@ pgport_sources = [
   'noblock.c',
   'path.c',
   'pg_bitutils.c',
+  'pg_popcount_aarch64.c',
   'pg_popcount_avx512.c',
   'pg_strong_random.c',
   'pgcheckdir.c',
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 82be40e2fb4..61c7388f474 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,10 +103,15 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
+/*
+ * If we are building the Neon versions, we don't need the "slow" fallbacks.
+ */
+#ifndef POPCNT_AARCH64
 static inline int pg_popcount32_slow(uint32 word);
 static inline int pg_popcount64_slow(uint64 word);
 static uint64 pg_popcount_slow(const char *buf, int bytes);
 static uint64 pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask);
+#endif
 
 #ifdef TRY_POPCNT_X86_64
 static bool pg_popcount_available(void);
@@ -339,6 +344,10 @@ pg_popcount_masked_fast(const char *buf, int bytes, bits8 mask)
 
 #endif							/* TRY_POPCNT_X86_64 */
 
+/*
+ * If we are building the Neon versions, we don't need the "slow" fallbacks.
+ */
+#ifndef POPCNT_AARCH64
 
 /*
  * pg_popcount32_slow
@@ -486,14 +495,15 @@ pg_popcount_masked_slow(const char *buf, int bytes, bits8 mask)
 	return popcnt;
 }
 
-#ifndef TRY_POPCNT_X86_64
+#endif							/* ! POPCNT_AARCH64 */
+
+#if !defined(TRY_POPCNT_X86_64) && !defined(POPCNT_AARCH64)
 
 /*
- * When the POPCNT instruction is not available, there's no point in using
+ * When special CPU instructions are not available, there's no point in using
  * function pointers to vary the implementation between the fast and slow
- * method.  We instead just make these actual external functions when
- * TRY_POPCNT_X86_64 is not defined.  The compiler should be able to inline
- * the slow versions here.
+ * method.  We instead just make these actual external functions.  The compiler
+ * should be able to inline the slow versions here.
  */
 int
 pg_popcount32(uint32 word)
@@ -527,4 +537,4 @@ pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
 	return pg_popcount_masked_slow(buf, bytes, mask);
 }
 
-#endif							/* !TRY_POPCNT_X86_64 */
+#endif							/* ! TRY_POPCNT_X86_64 && ! POPCNT_AARCH64 */
diff --git a/src/port/pg_popcount_aarch64.c b/src/port/pg_popcount_aarch64.c
new file mode 100644
index 00000000000..cdcfee464e4
--- /dev/null
+++ b/src/port/pg_popcount_aarch64.c
@@ -0,0 +1,208 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_popcount_aarc64.c
+ *	  Holds the AArch64 pg_popcount() implementations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_popcount_aarch64.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "port/pg_bitutils.h"
+
+#ifdef POPCNT_AARCH64
+
+#include <arm_neon.h>
+
+/*
+ * pg_popcount32
+ *		Return number of 1 bits in word
+ */
+int
+pg_popcount32(uint32 word)
+{
+	return pg_popcount64((uint64) word);
+}
+
+/*
+ * pg_popcount64
+ *		Return number of 1 bits in word
+ */
+int
+pg_popcount64(uint64 word)
+{
+	/*
+	 * For some compilers, __builtin_popcountl() emits Neon instructions
+	 * already. The line below should compile to the same code on those
+	 * systems.
+	 */
+	return vaddv_u8(vcnt_u8(vld1_u8((const uint8 *) &word)));
+}
+
+/*
+ * pg_popcount_optimized
+ *		Returns number of 1 bits in buf
+ */
+uint64
+pg_popcount_optimized(const char *buf, int bytes)
+{
+	uint8x16_t	vec;
+	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0),
+				accum3 = vdupq_n_u64(0),
+				accum4 = vdupq_n_u64(0);
+	uint64		popcnt = 0;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		vec = vld1q_u8((const uint8 *) buf);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum3 = vpadalq_u32(accum3, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum4 = vpadalq_u32(accum4, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * sizeof(uint8x16_t);
+	if (bytes >= bytes_per_iteration)
+	{
+		vec = vld1q_u8((const uint8 *) buf);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vld1q_u8((const uint8 *) buf);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+	popcnt += vaddvq_u64(vaddq_u64(accum3, accum4));
+
+	/*
+	 * Process remaining 8-byte blocks.
+	 */
+	for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+	{
+		popcnt += pg_popcount64(*((uint64 *) buf));
+		buf += sizeof(uint64);
+	}
+
+	/*
+	 * Process any remaining data byte-by-byte.
+	 */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++];
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_optimized
+ *		Returns number of 1 bits in buf after applying the mask to each byte
+ */
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+	uint8x16_t	vec;
+	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
+	uint64x2_t	accum1 = vdupq_n_u64(0),
+				accum2 = vdupq_n_u64(0),
+				accum3 = vdupq_n_u64(0),
+				accum4 = vdupq_n_u64(0);
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+	uint8x16_t	maskv = vdupq_n_u8(mask);
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum3 = vpadalq_u32(accum3, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum4 = vpadalq_u32(accum4, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * sizeof(uint8x16_t);
+	if (bytes >= bytes_per_iteration)
+	{
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum1 = vpadalq_u32(accum1, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		vec = vandq_u8(vld1q_u8((const uint8 *) buf), maskv);
+		accum2 = vpadalq_u32(accum2, vpaddlq_u16(vpaddlq_u8(vcntq_u8(vec))));
+		buf += sizeof(uint8x16_t);
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += vaddvq_u64(vaddq_u64(accum1, accum2));
+	popcnt += vaddvq_u64(vaddq_u64(accum3, accum4));
+
+	/*
+	 * Process remining 8-byte blocks.
+	 */
+	for (; bytes >= sizeof(uint64); bytes -= sizeof(uint64))
+	{
+		popcnt += pg_popcount64(*((uint64 *) buf) & mask64);
+		buf += sizeof(uint64);
+	}
+
+	/*
+	 * Process any remaining data byte-by-byte.
+	 */
+	while (bytes--)
+		popcnt += pg_number_of_ones[(unsigned char) *buf++ & mask];
+
+	return popcnt;
+}
+
+#endif							/* POPCNT_AARCH64 */
-- 
2.39.5 (Apple Git-154)

v10-0003-Add-SVE-popcount-support.patchtext/plain; charset=us-asciiDownload
From 26585ebe89d97bb99b549b8833f9c838cdd3a67c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 26 Mar 2025 22:21:10 -0500
Subject: [PATCH v10 3/3] Add SVE popcount support.

This commit introduces an SVE implementation of pg_popcount{32,64}.
Unlike Neon support, we need an additional configure-time check to
discover whether the compiler supports SVE intrinsics, and we need
a runtime check to find whether the current CPU supports SVE
instructions.  The SVE implementations are much faster for larger
inputs and are comparable to the Neon implementations for smaller
inputs.

Author: "Chiranmoy.Bhattacharya@fujitsu.com" <Chiranmoy.Bhattacharya@fujitsu.com>
Co-authored-by: "Malladi, Rama" <ramamalladi@hotmail.com>
Co-authored-by: "Devanga.Susmitha@fujitsu.com" <Devanga.Susmitha@fujitsu.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com
Discussion: https://postgr.es/m/OSZPR01MB84990A9A02A3515C6E85A65B8B2A2%40OSZPR01MB8499.jpnprd01.prod.outlook.com
---
 config/c-compiler.m4           |  51 ++++++
 configure                      |  71 +++++++++
 configure.ac                   |   9 ++
 meson.build                    |  48 ++++++
 src/include/pg_config.h.in     |   3 +
 src/include/port/pg_bitutils.h |  17 ++
 src/port/pg_popcount_aarch64.c | 281 ++++++++++++++++++++++++++++++++-
 7 files changed, 474 insertions(+), 6 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3712e81e38c..c2769b3bc21 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -708,3 +708,54 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_SVE_POPCNT_INTRINSICS
+# --------------------------
+# Check if the compiler supports the SVE popcount instructions using the
+# svptrue_b64, svdup_u64, svcntb, svld1, svadd_x, svcnt_x, svaddv,
+# svwhilelt_b8, and svand_x intrinsic functions.
+#
+# If the intrinsics are supported, sets pgac_sve_popcnt_intrinsics.
+AC_DEFUN([PGAC_SVE_POPCNT_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_sve_popcnt_intrinsics])])dnl
+AC_CACHE_CHECK([for svcnt_x], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([[#include <arm_sve.h>
+
+	char buf[128];
+
+	#if defined(__has_attribute) && __has_attribute (target)
+	__attribute__((target("arch=armv8-a+sve")))
+	#endif
+	static int popcount_test(void)
+	{
+		svuint64_t	accum1 = svdup_u64(0),
+					accum2 = svdup_u64(0),
+					vec64;
+		svuint8_t	vec8;
+		svbool_t	pred = svptrue_b64();
+		uint64_t	popcnt,
+					mask = 0x5555555555555555;
+		char	   *p = buf;
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		p += svcntb();
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+		p += svcntb();
+
+		popcnt = svaddv(pred, svadd_x(pred, accum1, accum2));
+
+		pred = svwhilelt_b8(0, sizeof(buf));
+		vec8 = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+		return (int) (popcnt + svaddv(pred, svcnt_x(pred, vec8)));
+	}]],
+  [return popcount_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_sve_popcnt_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_SVE_POPCNT_INTRINSICS
diff --git a/configure b/configure
index c6d762dc999..fea70c20ae2 100755
--- a/configure
+++ b/configure
@@ -17517,6 +17517,77 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svcnt_x" >&5
+$as_echo_n "checking for svcnt_x... " >&6; }
+if ${pgac_cv_sve_popcnt_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+	char buf[128];
+
+	#if defined(__has_attribute) && __has_attribute (target)
+	__attribute__((target("arch=armv8-a+sve")))
+	#endif
+	static int popcount_test(void)
+	{
+		svuint64_t	accum1 = svdup_u64(0),
+					accum2 = svdup_u64(0),
+					vec64;
+		svuint8_t	vec8;
+		svbool_t	pred = svptrue_b64();
+		uint64_t	popcnt,
+					mask = 0x5555555555555555;
+		char	   *p = buf;
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+		p += svcntb();
+
+		vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+		p += svcntb();
+
+		popcnt = svaddv(pred, svadd_x(pred, accum1, accum2));
+
+		pred = svwhilelt_b8(0, sizeof(buf));
+		vec8 = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+		return (int) (popcnt + svaddv(pred, svcnt_x(pred, vec8)));
+	}
+int
+main ()
+{
+return popcount_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_sve_popcnt_intrinsics=yes
+else
+  pgac_cv_sve_popcnt_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_sve_popcnt_intrinsics" >&5
+$as_echo "$pgac_cv_sve_popcnt_intrinsics" >&6; }
+if test x"$pgac_cv_sve_popcnt_intrinsics" = x"yes"; then
+  pgac_sve_popcnt_intrinsics=yes
+fi
+
+  if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index ecbc2734829..05266e6d656 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2069,6 +2069,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for SVE popcount intrinsics
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_SVE_POPCNT_INTRINSICS()
+  if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE popcount instructions with a runtime check.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 108e3678071..a41bf90d3c0 100644
--- a/meson.build
+++ b/meson.build
@@ -2297,6 +2297,54 @@ int main(void)
 endif
 
 
+###############################################################
+# Check for the availability of SVE popcount intrinsics.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+char buf[128];
+
+#if defined(__has_attribute) && __has_attribute (target)
+__attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				vec64;
+	svuint8_t	vec8;
+	svbool_t	pred = svptrue_b64();
+	uint64_t	popcnt,
+				mask = 0x5555555555555555;
+	char	   *p = buf;
+
+	vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+	accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64));
+	p += svcntb();
+
+	vec64 = svand_x(pred, svld1(pred, (const uint64_t *) p), mask);
+	accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64));
+	p += svcntb();
+
+	popcnt = svaddv(pred, svadd_x(pred, accum1, accum2));
+
+	pred = svwhilelt_b8(0, sizeof(buf));
+	vec8 = svand_x(pred, svld1(pred, (const uint8_t *) p), 0x55);
+	return (int) (popcnt + svaddv(pred, svcnt_x(pred, vec8)));
+}
+'''
+
+  if cc.links(prog, name: 'SVE popcount', args: test_c_args)
+    cdata.set('USE_SVE_POPCNT_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f2422241133..ac13112a892 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -709,6 +709,9 @@
 /* Define to 1 to use Intel SSE 4.2 CRC instructions with a runtime check. */
 #undef USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE popcount instructions with a runtime check. */
+#undef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
 /* Define to build with systemd support. (--with-systemd) */
 #undef USE_SYSTEMD
 
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a387f77c2c0..c7901bf8ddc 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -324,6 +324,23 @@ extern uint64 pg_popcount_avx512(const char *buf, int bytes);
 extern uint64 pg_popcount_masked_avx512(const char *buf, int bytes, bits8 mask);
 #endif
 
+#elif POPCNT_AARCH64
+/* Use the Neon version of pg_popcount{32,64} without function pointer. */
+extern int	pg_popcount32(uint32 word);
+extern int	pg_popcount64(uint64 word);
+
+/*
+ * We can try to use an SVE-optimized pg_popcount() on some systems  For that,
+ * we do use a function pointer.
+ */
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*pg_popcount_optimized) (const char *buf, int bytes);
+extern PGDLLIMPORT uint64 (*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask);
+#else
+extern uint64 pg_popcount_optimized(const char *buf, int bytes);
+extern uint64 pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask);
+#endif
+
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
diff --git a/src/port/pg_popcount_aarch64.c b/src/port/pg_popcount_aarch64.c
index cdcfee464e4..9c9e8b9cd23 100644
--- a/src/port/pg_popcount_aarch64.c
+++ b/src/port/pg_popcount_aarch64.c
@@ -18,6 +18,275 @@
 
 #include <arm_neon.h>
 
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+#endif
+
+/*
+ * The Neon versions are built regardless of whether we are building the SVE
+ * versions.
+ */
+static uint64 pg_popcount_neon(const char *buf, int bytes);
+static uint64 pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask);
+
+#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK
+
+/*
+ * These are the SVE implementations of the popcount functions.
+ */
+static uint64 pg_popcount_sve(const char *buf, int bytes);
+static uint64 pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask);
+
+/*
+ * The function pointers are initially set to "choose" functions.  These
+ * functions will first set the pointers to the right implementations (based on
+ * what the current CPU supports) and then will call the pointer to fulfill the
+ * caller's request.
+ */
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static uint64 pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask);
+uint64		(*pg_popcount_optimized) (const char *buf, int bytes) = pg_popcount_choose;
+uint64		(*pg_popcount_masked_optimized) (const char *buf, int bytes, bits8 mask) = pg_popcount_masked_choose;
+
+static inline bool
+pg_popcount_sve_available(void)
+{
+#ifdef HAVE_ELF_AUX_INFO
+	unsigned long value;
+
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL)
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_popcount_functions(void)
+{
+	if (pg_popcount_sve_available())
+	{
+		pg_popcount_optimized = pg_popcount_sve;
+		pg_popcount_masked_optimized = pg_popcount_masked_sve;
+	}
+	else
+	{
+		pg_popcount_optimized = pg_popcount_neon;
+		pg_popcount_masked_optimized = pg_popcount_masked_neon;
+	}
+}
+
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	choose_popcount_functions();
+	return pg_popcount_optimized(buf, bytes);
+}
+
+static uint64
+pg_popcount_masked_choose(const char *buf, int bytes, bits8 mask)
+{
+	choose_popcount_functions();
+	return pg_popcount_masked_optimized(buf, bytes, mask);
+}
+
+/*
+ * pg_popcount_sve
+ *		Returns number of 1 bits in buf
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static uint64
+pg_popcount_sve(const char *buf, int bytes)
+{
+	uint32		vec_len = svcntb(),
+				bytes_per_iteration = 4 * vec_len;
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				accum3 = svdup_u64(0),
+				accum4 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64		popcnt = 0;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * vec_len;
+	if (bytes >= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svld1(pred, (const uint64 *) buf);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+	popcnt += svaddv(pred, svadd_x(pred, accum3, accum4));
+
+	/*
+	 * Process any remaining data.
+	 */
+	for (; bytes > 0; bytes -= vec_len)
+	{
+		svuint8_t	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svld1(pred, (const uint8 *) buf);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	return popcnt;
+}
+
+/*
+ * pg_popcount_masked_sve
+ *		Returns number of 1 bits in buf after applying the mask to each byte
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static uint64
+pg_popcount_masked_sve(const char *buf, int bytes, bits8 mask)
+{
+	uint32		vec_len = svcntb(),
+				bytes_per_iteration = 4 * vec_len;
+	svuint64_t	accum1 = svdup_u64(0),
+				accum2 = svdup_u64(0),
+				accum3 = svdup_u64(0),
+				accum4 = svdup_u64(0);
+	svbool_t	pred = svptrue_b64();
+	uint64		popcnt = 0,
+				mask64 = ~UINT64CONST(0) / 0xFF * mask;
+
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	for (; bytes >= bytes_per_iteration; bytes -= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	/*
+	 * If enough data remains, do another iteration on a block of two
+	 * registers.
+	 */
+	bytes_per_iteration = 2 * vec_len;
+	if (bytes >= bytes_per_iteration)
+	{
+		svuint64_t	vec;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		vec = svand_x(pred, svld1(pred, (const uint64 *) buf), mask64);
+		accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec));
+		buf += vec_len;
+
+		bytes -= bytes_per_iteration;
+	}
+
+	/*
+	 * Add the accumulators.
+	 */
+	popcnt += svaddv(pred, svadd_x(pred, accum1, accum2));
+	popcnt += svaddv(pred, svadd_x(pred, accum3, accum4));
+
+	/*
+	 * Process any remaining data.
+	 */
+	for (; bytes > 0; bytes -= vec_len)
+	{
+		svuint8_t	vec;
+
+		pred = svwhilelt_b8(0, bytes);
+		vec = svand_x(pred, svld1(pred, (const uint8 *) buf), mask);
+		popcnt += svaddv(pred, svcnt_x(pred, vec));
+		buf += vec_len;
+	}
+
+	return popcnt;
+}
+
+#else							/* USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
+/*
+ * When the SVE version isn't available, there's no point in using function
+ * pointers to vary the implementation.  We instead just make these actual
+ * external functions when USE_SVE_POPCNT_WITH_RUNTIME_CHECK is not defined.
+ * The compiler should be able to inline the slow versions here.
+ */
+uint64
+pg_popcount_optimized(const char *buf, int bytes)
+{
+	return pg_popcount_neon(buf, bytes);
+}
+
+uint64
+pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+{
+	return pg_popcount_masked_neon(buf, bytes, mask);
+}
+
+#endif							/* ! USE_SVE_POPCNT_WITH_RUNTIME_CHECK */
+
 /*
  * pg_popcount32
  *		Return number of 1 bits in word
@@ -44,11 +313,11 @@ pg_popcount64(uint64 word)
 }
 
 /*
- * pg_popcount_optimized
+ * pg_popcount_neon
  *		Returns number of 1 bits in buf
  */
-uint64
-pg_popcount_optimized(const char *buf, int bytes)
+static uint64
+pg_popcount_neon(const char *buf, int bytes)
 {
 	uint8x16_t	vec;
 	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
@@ -124,11 +393,11 @@ pg_popcount_optimized(const char *buf, int bytes)
 }
 
 /*
- * pg_popcount_masked_optimized
+ * pg_popcount_masked_neon
  *		Returns number of 1 bits in buf after applying the mask to each byte
  */
-uint64
-pg_popcount_masked_optimized(const char *buf, int bytes, bits8 mask)
+static uint64
+pg_popcount_masked_neon(const char *buf, int bytes, bits8 mask)
 {
 	uint8x16_t	vec;
 	uint32		bytes_per_iteration = 4 * sizeof(uint8x16_t);
-- 
2.39.5 (Apple Git-154)

#33John Naylor
johncnaylorls@gmail.com
In reply to: Nathan Bossart (#32)
Re: [PATCH] SVE popcount support

On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I also noticed a silly mistake in 0003 that would cause us to potentially
skip part of the tail. That should be fixed now.

I'm not sure whether that meant it could return the wrong answer, or
just make more work for paths further down.
If the former, then our test coverage is not adequate.

Aside from that, I only found one more thing that may be important: I
tried copying the configure/meson checks into godbolt.org, and both
gcc and clang don't like it, so unless there is something weird about
their setup (or my use of it) it's possible some other hosts won't
like it either.:

```
<source>:29:10: error: call to 'svwhilelt_b8' is ambiguous
pred = svwhilelt_b8(0, sizeof(buf));
^~~~~~~~~~~~
/opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15526:10:
note: candidate function
svbool_t svwhilelt_b8(uint64_t, uint64_t);
^
/opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15534:10:
note: candidate function
svbool_t svwhilelt_b8(int32_t, int32_t);
^

<source>: In function 'autoconf_popcount_test':
<source>:29:24: error: call to 'svwhilelt_b8' is ambiguous; argument 1
has type 'int32_t' but argument 2 has type 'uint64_t'
29 | pred = svwhilelt_b8(0, sizeof(buf));
| ^~~~~~~~~~~~
Compiler returned: 1
```

...Changing it to

pred = svwhilelt_b8((uint64_t)0, sizeof(buf));"

clears it up.

--
John Naylor
Amazon Web Services

#34Nathan Bossart
nathandbossart@gmail.com
In reply to: John Naylor (#33)
Re: [PATCH] SVE popcount support

On Thu, Mar 27, 2025 at 03:31:27PM +0700, John Naylor wrote:

On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I also noticed a silly mistake in 0003 that would cause us to potentially
skip part of the tail. That should be fixed now.

I'm not sure whether that meant it could return the wrong answer, or
just make more work for paths further down.
If the former, then our test coverage is not adequate.

This one is my bad. I think the issue is that I'm writing this stuff on a
machine that doesn't have SVE, so obviously my tests are happy as long as
the Neon stuff is okay. We do have some tests in bit.sql that should in
theory find this stuff. I'll be sure to verify all of this on a machine
with SVE...

Aside from that, I only found one more thing that may be important: I
tried copying the configure/meson checks into godbolt.org, and both
gcc and clang don't like it, so unless there is something weird about
their setup (or my use of it) it's possible some other hosts won't
like it either.:

```
<source>:29:10: error: call to 'svwhilelt_b8' is ambiguous
pred = svwhilelt_b8(0, sizeof(buf));
^~~~~~~~~~~~
/opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15526:10:
note: candidate function
svbool_t svwhilelt_b8(uint64_t, uint64_t);
^
/opt/compiler-explorer/clang-16.0.0/lib/clang/16/include/arm_sve.h:15534:10:
note: candidate function
svbool_t svwhilelt_b8(int32_t, int32_t);
^

<source>: In function 'autoconf_popcount_test':
<source>:29:24: error: call to 'svwhilelt_b8' is ambiguous; argument 1
has type 'int32_t' but argument 2 has type 'uint64_t'
29 | pred = svwhilelt_b8(0, sizeof(buf));
| ^~~~~~~~~~~~
Compiler returned: 1
```

...Changing it to

pred = svwhilelt_b8((uint64_t)0, sizeof(buf));"

clears it up.

Huh, this makes sense, but for some reason Apple clang is fine with it. In
any case, I see that we can optionally specify the expected types of the
arguments by using svwhilelt_b8_u32() (or _u64, etc.). IMHO that is far
clearer. I'm going to add that to all intrinsics that support it in the
next version of the patch set.

--
nathan

#35Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#34)
Re: [PATCH] SVE popcount support

Committed.

On Fri, Mar 28, 2025 at 10:25:26AM -0500, Nathan Bossart wrote:

On Thu, Mar 27, 2025 at 03:31:27PM +0700, John Naylor wrote:

On Thu, Mar 27, 2025 at 10:38 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I also noticed a silly mistake in 0003 that would cause us to potentially
skip part of the tail. That should be fixed now.

I'm not sure whether that meant it could return the wrong answer, or
just make more work for paths further down.
If the former, then our test coverage is not adequate.

This one is my bad. I think the issue is that I'm writing this stuff on a
machine that doesn't have SVE, so obviously my tests are happy as long as
the Neon stuff is okay. We do have some tests in bit.sql that should in
theory find this stuff. I'll be sure to verify all of this on a machine
with SVE...

I verified that the tests failed without this fix on a machine with SVE.

--
nathan