[PATCH] Hex-coding optimizations using SVE on ARM.

Started by Devanga.Susmitha@fujitsu.comabout 1 year ago48 messages

about 1 year ago

4 attachment(s)

Hello,

This email aims to discuss the contribution of optimized hex_encode and hex_decode functions for ARM (aarch64) machines. These functions are widely used for encoding and decoding binary data in the bytea data type.

The current method for hex_encode and hex_decode relies on a scalar implementation that processes data byte by byte, with no SIMD-based optimization available. With the introduction of SVE optimizations, we leverage CPU intrinsics to process larger data blocks in parallel, significantly reducing encoding and decoding latency.

We have designed this feature to ensure compatibility and robustness. It includes compile-time and runtime checks for SVE compatibility with both the compiler and hardware. If either check fails, the code falls back to the existing scalar implementation, ensuring fail-safe operation.

For the architecture-specific functions, we have used pg_attribute_target("arch=armv8-a+sve") to compile, enabling precise compiler control without using extra global CFLAGS.

System Configurations
Machine: AWS EC2 m7g.4xlarge
OS: Ubuntu 22.04
GCC: 11.4

Benchmark and Results
Setup:
We have developed a microbenchmark based on [0]/messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com to evaluate the performance of the SVE-enabled hex_encode and hex_decode functions compared to the default implementation across various input sizes. The microbenchmark patch is attached in the mail.

Query:
time psql -c "select hex_decode_test(1000000, input_size);"
time psql -c "select hex_decode_test(1000000, input_size);"
The query was executed for input sizes ranging from 8 to 262144 bytes.

Results:
Significant speed-up in query performance has been observed up to 17 times for hex_encode and up to 4 times for hex_decode.

Additionally, we tested the optimization with the bytea data type on a table of size 1435 MB containing two columns: the first an auto-incrementing ID and the second a bytea column holding binary data. We then ran the query "SELECT data FROM bytea_table" using a script and recorded the time taken by hex_encode using perf. The results are presented below.

Default scalar implementation:
Query exec time: 2.858 sec
hex_encode function time: 1.228 sec

SVE-enabled implementation:
Query exec time: 1.654 sec (approximately 1.7 times improvement)
hex_encode_sve function time: 0.085 sec (approximately 14.44 times improvement)

Improvements using SVE are noticeable starting from an input size of 16 bytes for hex_encode and 32 bytes for hex_decode. Hence, SVE implementations are used only when the input size surpasses these thresholds.

We would like to contribute our above work so that it can be available for the community to utilize. To do so, we are following the procedure mentioned in Submitting a Patch - PostgreSQL wiki<https://wiki.postgresql.org/wiki/Submitting_a_Patch>. Please find the attachment for the patches and performance results.

Please let us know if you have any queries or suggestions.

Thanks & Regards,
Susmitha Devanga.

[0]: /messages/by-id/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=ghLQ@mail.gmail.com

Attachments:

hex_decode_woFlags.PNGimage/png; name=hex_decode_woFlags.PNGDownload

hex_encode_woFlags.PNGimage/png; name=hex_encode_woFlags.PNGDownload

�PNG


IHDR��9Q�sRGB���gAMA���a	pHYs%%IR$���IDATx^�����E��
�{��=���.,������*�q#B��w!�"D��@�g�{�z���������I&����<��L��u�]u������<��3&��d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&����I���d2�L&��d2�L&�������7�|��c��_��>��c�L&��d�i7n��+W�������y�f;p������>��S�L&��d2Y���w���[m������������eeeVZZ*��d2�L�"����233���#�GB��;���\&��d2�L��
�����N�8�
}�nZ��|D�!�	!�B�t��������G"�GP!�B4����;�
}�nZ��|O��Z!���SXXh_|�����A��B!�h>�����_~�
}�n�D>!�B�zF"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"�B!D�F"�B!D@"_�'�&�%������	�|B!�-�|
����[�t�M�4�6n�h�P(�����g��={�m���k���o���f��i�/�S�N%~����i��/�e�������:t����c3f�p��G�s�.7n���+W�����5p��ZJJJb�o��~w��S��k�r�J����]Z��5�v����{���p�|��y.O��������ow��<�)S�����-�$��}x��n���7���'���U�p��ms���p�������;����O��^eee�-���9����s�Zjj��CA���)7H�'N�g�}���/W�m��i�(�����5k��O����������p�����#���m�����G�R&W�~<�u����?���������'
yyy�8|��sb����������r�������������������R�8~_>���\A��������8??���s���]�|�k�����q�������+���G�vw������%_���]�Vm^����t��y;p��]�t�k������m'O�ty��]�p�N�[8�����gk��7����s:r��g�iH�Gy�64��J_�k����`;��z�j�y��Q^��k��*��|
����K/�#�<b;v�Zc��id�����.]�����8��{w����f�=��kp�oz����]�vn?��}���{��g��_��Gu�|����s��N�I�q�+���w��c��p�����m'Nt�Kr��>�z�����/��C�m9eNr��v����o_��?�ac��i��Ic�2���������z�	��
����������G�;���S�w��;��iRo�i������o��>p~z]�}2��������C��� ��9!�����(�M����t"m������?���O�����;.�����g�v>��%i�_���O���?n[�lq��o�i����s�����>��K����s���������7�p�~�&���srr��?�7���?�w��F����������t���k���:����Cv����}�Y�%��|�Q@eB@AA�AB%����/�����%����y��W�������w����%=����w���$���!M/�>����:u����V�\y��A*
M�Ee����]%��������woW9r�?��-Y���z�W�Z�6*T"D�A^��w����z�)���P����+�{@���yo�}�*��4�xopq�p�8������78w���1��
���*��60���F��'�x�='����\��p��W����6��W%��,�$���8�8����wr�F��+�8������5�:i����������	�[D�����N��
[~�������:�7��=hTS�S�R�yC�@���w��������s�A��s�g��XG��OF��q<���#G�t~�1��!�_�~��m[w�D��{m���t�<y�7�Q{�4���<C|���
�|C�G�#���{����d�%|��-������F@��m��
���?t��C�K;�`��(2"��+�M�u�������={�t����i�e���$X�7�~�b�r\�u<S|o}��q��<����us/�����y����E7�[P��i�~�z;v��
|�������������U�7���>�����O����+>*�S��]�6���/$��F����?aP@��p��X�hQ����X*|�����SIy(�(�[�jU)��=h� �sNA���7���R���:����ql�������TVA��F9"�U�s��p�uu����B��lf/�*I�%�3���������!�S�7������pv'����B����� �$b��*�qq�8�t��I�@��3Y�����7�e�&�Q�Q���lh�M��8\�s���)[y�8�l��H��Dx����2�N�;��q"�����.���c�;D�a��)��������G���*�a�E4�
)��q���}�D�xt�����C����"r�c 8�,
��"~T�~��i�Q����J$!���
���2�H���,����{x��A�C�FJ�\�]���> P������`d��x��f�,���Qo#.������iB;�w���S ��3�����,�	�]r�/�h��a.xv�b��1D��	���2+9���N>�y����TAh�M�0��eXC�X�(}����@p���9�@_�1���#����y�xW�/hc;�%�5�D�A����%��;��=�W������oT8�W�������P��R���LaO����E%C��@�w���� �|������3��@���K�~�
���k<xp��~��IF�DP�������k��8�D���1�|\N
e$y���5�7�g�N�Iit��R�����98��8��C���FDu�hq��3���������A�3�H���
e#����_|�9Q�C��68g�
�%���Q��	���7�t���rxE� ���!�y_)N�x����������Hd.b����~�0�����i��C��i��x��E��/�?��g(du����;\/~S���<G(�X|r,�5�w�Cx��{��u��}�X�����oV�����������v����X��o���N����@����W�c�)�^��NL.��0@~���b��f�G�FGe �gU~Ap���Q�����H��c�=Cxyw��V~Vu���]��;�w_����`4�����F���L?�oI�C;*�L� �t`����v����������&������5HM�3�]������@@��WUV�oU�
~W����O��:��U����������
H?:������	�����|�Q@���������
R4:)i#�����kKD9*
.�T|�I&��p��f�C�1��3���H�h��P�cq3Q��1�����<\'�%�GH�
�����tzF(L�CLZPyux'�d'�,��A����1�|8n���A8G���������s�xG���Z�o�D���
bU"b#������u��U9U��2�a��8��a<<z_�"�p�i� �7��@&7��d�QLo+�����$�N�\�4��W������2����D_�e'�#"��U(��7��,~Rp:�:_��wl�#����9
7�Aue����������/�����@9�u�z�:�a�������x�x��{��~A��S�k��c�z�������H���@��g�'����v|��y'C�xf��2������;�0b.��.�Q�so���hi���\�.B,L�{�9�DZ�#��D�'CZ�#s^�X���>@��GJc_���[hk�5��|��N�}p�
s��>p�<o~����RD+��$�G�ks����I'|0��<C�O�N���gt�����i�.�||��isp_�����$�}�x<k�����hF���C��,��
e��k���{��r���g��A"
���)S|�����4�O�WP<e[��w�|�h���"������E>��r��d_�����n)�������o�,)����w���w��8/����2���q�����=%-�3����{���X��\�����;�/���f�B^M��k�]!��Q\�7�E�#�]��D��C"�hp(@���Y��� �1������"�W�FD0�1�"�&
d
*`
5�����B�����8t�(y�*+�����X]>���������a
8�8Q\'"2�4-N=[���&(��b�c/���TPT�8)828�8��p��1��-������I3����HS������5����'��H'����i�)���\85lOE�wU�v�G��{�9W��sf[�����{�
��c�iC��t0?��qp�=��wk����|8��"�s��AJ$x-�7�i����
�g��8�8�8���:�4L9�����qpygHChA����K���E�K�K#��w��y�����8�#���y���s�s����#=���
�K���PfQF�������&�1��^X/��nt2������p�)q��[�1t(x������6���,��W��8�L���}$��F3�%BF�NC����?������G�A9I���!�1�
�y�^��������H���PM�z��������_�z��&����H�9:h��86��E�"�#d����a��=f;|D��A�mxN���L�<:}���&��~������9rM��E��{�!���P�������@�N=�`���c������e�9�8��vl��Q��T�D=|X�KZr���|�h���? �����n>W��'�M��n�+n%�q-��g���E[�q��t���'�3��|�������Q��S�S&��|��?��6��h�8�9sn�>z?�z�z�c#�y:(����\x��_��{����
�*\u7�w�w!�����������
��_���_s;�����8���6Vr�������>2e����<���}\?���>��D>��\7e��n>��n���&m�f���M�v���J��o��|EyD�o�X�%�Ga�9�W~����cp<��u���w�)|Y�#O�7�&�/�	0�n~��$�7��b���u�1U��k�e�G|VW�^<w�q�k%�K�k8$�����;�dH*
*L*N�<�xQXPI��P1D�^;*#��"dy1(`v�>>������B����5����R�EH~��k�E�G��7	��T[��Ge@E��h4p���p����������qz���W�'��)����s���,V����h���P��i���������A��a�8�8�\/�
B����c�v<{���x��A�!��-�@�q\P*]�7����"��n�?��z����{]��|���}p ��au3j"���x��{�����3�5���w���;4�x�HS��y%�H�u����6�/���q|�A��q'
9_���|�3�1K�y��G�&�
y�<������������<��U���'�|E����P�p�����D
�	8��=����tqn��:����v\s����B$��uR��4(��t��|��u)�
����o�C�a�+���|�^���QGC�Fu
@���e��}�Gy���E�������|��B������"b���:}'��]�744)Ch�"L"��~��(w)�?Z�6PR7pO\��HG��?����)�T�c��i|"tQ�R�sO����u�(
u1>���+��4��x��h+:t�c\)���IC����Rq~���G�Q���q�������N�Y�P_��wPO��RO!:�#����-H�L]����>��kE��s��x��[��F��	��N���������Yr�>2����R��w��l��A�~u,�������tD�#s,��#���o�s�:8��g�3�"�1�
�D��3��������3'�y�����-��e@�'���������{��\���;��w�r��M��~�w;�^�QJ^�>�
���?��M�����R����@4,�����8e����y�q�?�A|�x�������&�)'(G�[��w����#M�xy7�����w���(j��s����g�{�y���
Q�������qi#�=�e+iO����|G�����m)i_���>\3���=�>HO/�RG���u������ sN�Q���_������x�I"_�!�O48�"T�^�B�Ge���}�\
1"g�,��p<�@��(���cP���)��`�������B	g���v������S}������R�b���Rx��c��q�\/�?�
OP��HaM���b����+
*.���Lx4h��*���|n�=��C�#O�sc8T8������y�6�
�z�u�������"���a[*c�$D�lQ�!�s����1	
$8���o�98�s?89�a%{g��O������;�&"�O{��|��!���/\��7�cQ�s������<�gy8��)~������p�h����d�>�=���Lx?y�)|dZU"�o�p���L/7�V<G�������������z��w�r&���I>b�`�DM�^p�yW�f�op�Sy	����Q����3�}���X�����8�UE7���D�������N�^��e?��,���q���}�}��G�F���X~�<������C9�?���u3
[�%s���w�<�������l���L�e�C:x~�a�1���	4t�P�q��w�A�W4*k[���G�K9���oh#|���A�E����4$�����q)�)���){*8nu�T����C����y��"�ihSSwP�Po"����9?�
}������*���%��mi���9O0�
��:��9��+�%-�x���E\b[�%�EDy|�*~
{]��&"y������t���
y�g���i�o����K~-i���g�7�N�a�'j#�!q�^�E,��WU���~�a��mIs�_���gN;���<��\#��I�1���Dz��(K��E`B�
�;��\���(��i��?�,�&��9��G}��y�g�Cu����x(�i����|^����C(?��y����H�yg����)c�w�}#-�?��=�3���y��_^%����i�%��J��_��)����5#�{)�����������;�=�.%x+���y�$�5<�D�CE�HC#����������%|%L��3Ba���R��q�'�B�Bg����)`9�a����Q�!�O
N�D*���� �`�GFa�T�D�QQy�:)���������B����wm�"�=z8'�dD"$���^8%��#p
�Tz���������w������������|(��BEIKz�.�G�
���
�;�8����|D��o�AT�T8T�>-�T�8C8J�7�&���_3���l��g������`{7���oDp<G�A����W���q���gB�N�����&""(�(�E���p.���8�\7� ��t��D��=�^�4<�����h`��8��#���������N��w�f"��g�}�|���;�]u���������e,y���_��c����40������ p��Ww��A�F���qhH��)�h�@C����,hxq���`��w�4������^�����r!XNQn�
�_�| �J��?x�)�(Oy�x��_�����7�T!
�����(
�k�lC]H��{��D��eU����=�
�����'|#0�p�4�p~D�i�������7���FA�Ch
�_�qI"����$(�(s�6���H_Q�g����

\��!�O�H/ly���YQ_�Kz�v@\�O��4�9/�F�f\#�@9����EH�����4�Z���6�?���~+�������d�`�`[�"�EpE,"��;A��~��O�3�biC����}����.�O~�A�y~�Q���p�#n%�!>q=�n#�x����<k�E~��.>b����o�
U�v�?��u������|��;�����,�q�|
�+������D���}f| �~�v�����](��H�9�}|�>�#�6\e���F{���r�4�����������q1����y�s�����$?�?�SQ��
��������1���y����7��7��h���gL�#��.�����l��sM����R�_P�"����"�;)�������������w �TUG�7�#Hc�������}�/��g�}(��q �O4� 8��8�� �
�i_�S��7����Bg�
G��,	N6N*�cr��p�(�(��������pg�Eu�&P��Xr�T����p~���������
O*���3�.7���NB�����*\D
*M_�#�!�Qs�|OZ�vT��+��8<���8���k��y�T2��TP�Ri����$]<<�/N(����Q�R�zq��O�
�!���P���������8 �p.���������86	=�B2y��pn��@�s\��1�*d*n�'�=T�8G�?�`���ND>��!��a�'H#_�7" ���P����?��HG~�����pl��t��|G��oW��y�g�����<�,-yN8D�%����Rq�'�)@�����3���4�=F��������{�9ue���p,�� ���'������5�!P����/KH{�u�ID����*�r�<�����D���������H:���3(�|��7(�����q�gE������4��M�2�����_�"��u�!����~�NG���� �u}2�q4n9��D>�*���!�%�{��%M(�(��kz��:'X?R'����hS�Q������w������7�>��5�[�/���������S�"�������O�?C��t�~���4�I�;�4���KQ��Vu!����:9��q|���~E���d[</�r��U�&<_�E�8����|D��'�	��}�� Or<���?Z��7����EHK�`��G}O�s��<{�������\sP��]@��^9?���#y����D>��k�+��/�^�E�/����y���]��q�<����f����� ��N�����x�{���/��:�K<7'D}'�
������B��3�]h_��(O1���*x�<3�.����C_.QVrO��������n��d?��r�ve
iO��zx^�7��lU"�1~.�2��������V%����h��|A~����>�_�s�u�>����#(��;����r����'��>&�����cK�kX|�A^�/��'88C�@�B5���'
Q2@e���������N������8'��N��\Yy�8�����J���B��
������p�\
o_YSA"h�[�(��q��8s^�

j��	�2(���<4pP1��:����	������E ��
���B�''���q*���L���W*p
w������JywI{*,����s _p\/=E8�T�8�����$�n\iN!�P!��T�(8���q���x>���qy��j��4���I m��#s>�-��8�E>�E��{�)���y��k��[�GU���C��%���^u��5����6���iAt	����s�=�Fl�;����#�d�C��@������J�#=p����������"_�N��SF�<xx��B,�����[]@#�<I���w��<��c�:i�q�8�<?�������������w�t������To�AD��&�9i�&�1�����q!������Q,�7��F�Ey���������:���_w�-�
*|
M�����M��`;��������P�S?sl���|�C9�u�y�+�e��"������
�J������A���|�������D>��!Y��o�9��;�K�Rf�4 =�9����k�����Gd�������N�YU"u>���(��u���1���c�k�o��|�Iw�����"������>���������\W�~�G�����X�|
�����T*A����|�8�
y���o���O������L�ax?�E>��?>$!���������s#�u$��9��'��_���p�c\7�s��\'m|������e��A��W����]�9����~�O�/�.�M@�H9D'jPH�)��/�����������5�����������3����x�7�
���B�!������O������4��r�w�&"�4��Y����J�Q]��NZ�l���9����mR��-���:���<���	�M"_�����'
*)?'��+�X>q&����^l
^
I\
Rz�(x(�h�J���pd���0�i$?x(<�hqz|/�!	�����S(VW	S�SI��p8������Z��)Xi$C��`������P�rO��}H����\X���dG�?��R �(�IK�����?�#��K��������9(�q��G���N ���8������g�����3�"'�q-@��D����@�����	h���#�j�����q���i���%���1�����p0HV�
1�t$�s��t���c�d���#^Q����51l�*��[�|�?"(�y��w��|/1��yp����Y�N��|4�1Dj�c�L��"�����L���T��s���[<c��;y�4!��t��I.Wj�oP.�e�?����Y��r�����9��q��
�&DK�/������n���
�M��SGTC^8�<�h\P�H��(o|]F�������F*�u����(�(�(�x�)��>���a��^8C��a?���r�r��z�:��2��������*����/��xq��(�|t"~��p��H�le?�#m0�,�����GH=D���GZ���O��987p�~~<��k%���o�iB:�F����?/POq��7����������<��B#�|C����~�|�^p"M|}�s!">�<�5mOZR��H>��r�g���#��3��U�?�|��=��[�|�a���h��G!��ip.�Ap}��T%�q��#>b��}{�18\�gA�qm��q\�r�u#��J�c���x���:�9�>�-(�xHc��<�{��h��w���&�I�&i��e�tf��$�||�</.q~��� ��������q2A��Vb� ���-��F^GH��$M����S�Sv��S��Y�3��x��7Y�#-�����'������n%�q-�3\7eDu�m�`��Pm�`G��������H��2�k$�S���nDB�|
�D>�`P��Co!��7��{*W�8>Tv��������Q��TT0�4�q(��u�g��B��(�+�AX�Q0D�T'�qMKlG�[U�p�3�v��@�u�rD�� M��J�t����UUp�su"�3��p�_�����;�T`T|���2���8�G����.8<8���O���J����r%�E>�����x�$*O����+^���������E><8"T�</���G�	K�G��x�E�I�k�\�C���p�K{��V"N)���J��X��:�z�q�
g�*��k��
4����'���"
��!��"�c>�'�Fl28-D#�g�����=�y�<���������Yx��<l�TQ��o�k4�y�1�=HP�#��N�yx�{~x�?i&���B�#��~���/�n����K�6D�9�W�z�\��B����w�r����)�y���B=�;�o�i@�A���\�|�\M�w�����C%��_����W��?�����'�$��@��r1HMD>�U�a�t�~��;�.�/e��J!2���g�q��I_�<��������!��5z���A]w+����|C�E��5��2�2�t�� ��+p/�����{�l����l�����GAq��������6��(�E�%iD�t�>_Ux�_%8�	pN�W����B����~��z_
��t��A���j�i���w_�gD�K���<�z��n�}���l�,�������p0��Y�
~�1�)��	y
��N�iH���`M�#��v�TP���"���|G=���c?�x����������^��w���������]���5��1mC��S���|^ �����������7����������c��,�&�=��V"y�|�o�5��qn��s"�p��SG>5�&�����%�T����{��S�����u��y��~!�I�
����G�P`Q�P1R� vQ`�h�^ >)���y�83hp@�J�MeH��+R����8C4�)�pP����N�
��%r�sabU	2���J�������r ��^T�������������T\��R��a�b��'�k�1���g�9�|�h0_�s���`��P���x�8�U]�� p��^h!�����y>�Y"��D�c��b���x�����c�9���|4r�?��Q�qx����a��@�����o����p,��c��`qH�_��}���c���~����(����D�#O��,��+���w
��
����3��q���U�|\C�$y/y��&���lO���p���O���#����qn%���p.nG�����"O�p��r(�Ly87���q|���X����:��*x7yGyI��gO���3�_f�z�����yq��;����w'6��s"�I_q��������#��~ }�l��B$K.x����>b���:�3/r;�x��()K��|��w�r����i$����ld{��`�`u�@�Gc���F,�e3�e|�5�P����:�F eu)��y�(�}�N����FF^��� -T�v:B�������y?���D����G)��x���G�3�G:Pw�T�#m)��K�9����E�/�~��P�m���=sN�"iA^��t��x���|�����7� _�������#b���. �s����^9��>����W���{�g��c�.�,wx}G����_��^�����a_��c���<�?�������y/��f"�sm<_��#J��\<|����'y�\�;��g���)���)�����S�W��p>�s����X��!Mx�HSF.�\�}#������k�|�����"��N�o�H/�����p��KME>��;���|Nz���E���)W���������^�)��|F��|�{�=r��"O��u�����t�8�K���J��s����GrA�RU���@~�\�ey��Q���Q.��a@�=i|�\��o/��E"�hhX���xQ��$&?+*B
,�%
�*,�=tA(��)t�a�I��* ,
H
&��.��?�l��T�m�� f� 88`��Q�S���Q���Q1���)H�'��6PH�D����t�r �q<�38{8��`����#������FB�����T�80�q��"�y��RY�nG���ipm;8���x*d��g����SM�#>yF�gD����8��F�
�?�L�s-8���w����t�����P�V���)A���wR<�5����A�#��i��o����k%]�~"yg��p��x��h���pZ�s"��`�642|�4�N%�	�x+��F�w����������S���xg���e�5�\��4@����oGD!���Q����~��#/���I6��g���%e��v��C��"��o\���D�����*�?�d�-�A��w��?�z���21�px��}����U�O<�`=��
�_��Euy��
n�z����I����
?����?���q��-��tfx��)�]D��z�M�@����SH/����^�L�I�K�x��IK�B�c_����W���4P}T�
y����+`���w"*�tcZ�:��n4�����
<|����>��u��c��|��<@^���{��y�6�Y^��M�
u�������gL���������<E:QG�L��]�;��I3����?'�	���f>
~y�my��E��IW �������}l��(�N�Az nq��+���o�8���?��B���?�
IK����.��������o�l���\KUw������y����2��{���X�� ?�-��O��I��>����)'��c���?����!��n\#��g������+������DG��������Rfr}�C<[�;���K��A|/��u,"�o�N���<�_�����
�o��>�Y�!����x����� t�r-��/wXA�
?�?k�GA9�\GP�TUGP7�����~�����7iC����u�3�O��9���F��9��{$��'�G��\���^l��1�1���Q�$kTh8��0�*�`C�����N�E�M��J�H��3���v2�1D�!��G��>@E��R(Rx"�PyP)s]��=r���G$������@o���"��{�9={��{�yQ!�x��^����YA�A�a{?��c�����yD4*�K�(*
*Y*K��&"��;x�)��G��&�g�J����\����S������<��g�Qi��8�a����IO��Y��B��I{/������"���Cz��9/�<���o4��C���k��`?>y����B�U�|<�����$_�|���$��>�
��F���������L���?N
���i���i��*?�P ����{l�>|�`���������/�/8�^DE����(��8�A�)�!B����B��lA\�\�����TUfr|?�7�sf��AL�$]x��<O�8h\3������D���2��_���:�2��a�w�r��(����)���(���������K���)�k�LDx�8��A�(�L������'Z�N�����2����
`��+8D(��o���r�w��@������>/�q\��{�������yD
!�p��wn��W���}T�\�m��n	>{����<W�8(t����vl�o��;e�I���rM����Q�%�$���k�Y���w+�O��gF:x�|���z������=�5���3���x��`g"����?�H�E�C e_�s�u(�������<7����������4'���[���/�`�����p?���xO��)�>������������7�lT��{J���M��_��������N����/F�'y_����3����A�!�_F���9�#�Hk�/�9�;~��I3�"��'~'e8�$�%�_����WU�S�Q~�r�VwRGP���O�>3�Q��)w(��_�	��co\#�guu���D����h���'Gy����d;
"*
*d*MDz($|�������2B_rJ���j4�9�7D�7*A*�[����h|�������Q�k�N$ h��DD�G<�!n{�Ps,����G�Oo��/�qD����C���!������o�|�x�3�=��H����Vp�\'�Q#���
�)����'�	���;T��1D%���|T�T*�������7�m}��i�J����!@�X�"��<I�<��R��Og��s�<l��O����I����
��{@��i��7�I�G&�����{����|��pX|Cg���y?�e1�F���r����!��|��!j�=����V�W�7BhP���p�����]���t�Y��3��E��2�F��;��1��0�����������;L�d(qnIW��$�9�"��4�g<4@��G�������|\|�����o4xgD�@"���	~e���F-��o�!���lA|�}|GD]�����Fc������BQ?H�
'~�zB��sKA�

D��L�D0�!J�a��q9>��|UA�Ic��4�
��i��I�uD<��
�{�p�G�J����\����w"�B	���"��P��D0�]CHe�[@���-rl���j
��
>B�g�D��?y�f"����`�o��4F��9!:!��C��AO&����kB�I~��'���s#��������Q�J��-z�F�VP���8�7�1�<���G��(�H��P���Y/���3bX7�K���H�0�����|�1�G��hC{z��V"���b��1p�Bw+�86�%���T\���M��O�'���5pN���W$��dp�H���?DN�	Uw����i�������E�K�A#�x>�|#D�F"��~�4�7�m�F��'~S���+��	tI������$�>�
C����Vq�H�
"
�(4x�������4����6��N�#*�q�CE���r|@"����l�{b�7b��4l�����0��SU �p?�G�_��p'$������~C�	^#i�w�^|Z�F!�r�����������~����VP���8�?O�X���%������'���\Kr!��o~���Cu��t��`[�)�����Xl��9����2�te{�O:��.!Mx/��d�����K���_r�����tJvL������������q`_�g[��y�\���E����������|���-�!����<�����|�7�{8�������I�}��1��{��#�H+���Tu>����
�q~����M��D>!n�@_����F������2���s�e2��'��t2�\���|B�:�*���D��+E�WUF�#�4<9$�u�D>!���@G7�F��c�F�����0�D���N��'��� ���d� ��1GZu��Z�h&G	!��=��B���F������}����!��|B!�-�|B!���|B!�-�|B!���|B!�-�|B!���|B!�-�|B!���|B!�-�|B!���|B!�-�|B!���|B!�-�|B!���|B�FAEE��_�����g����N�:}�>��KMMM��rIOO�#FT�K��K�.��'�X$Il�M���m��e��{w�O����w��?�����[��y\�|��n=z��u�����M�|B!���|B�F��y����^�������?�����?T?���6r�H;w��E����w�;v���n������������O����'{��G]�<��cN]�d���q#��W�f�&M�w�y�������0�����c�e&����3���>g�	B4q$�	!�B4o$�	!�J�-���z�^y�6l���L�b�'O��s�=gO<�����nB���5{�l�����?��l
����m��i������o�����O�n&Lp�������e_P�r��K�����.
��4�X�G��V�Z9�p��!N����g�k���y,\�P��M�|B!���|B�F�����^���%�����o_;~�x���Cvv��3����Z����������}���a�D�!�eee%~�G��i��������c+EP�DH��������;�4''���]�x������s�����"w���h^4e��k��sss�����<������������w����w�c{�� ���3�>����[Cw�!�BT�D>!D���|�Q�/���j���|rU�n{��M��4��v3n�-��8q������U6��:FU���m�����u��L4C�i�z����<���<�� ;a����r�(Cy��
4v�W�����=z�}�p(�O��CS��������o���g���D*����������2����~7���p��k���r��
�8�s���1k�,Wr.�@��?q���������/�����_
!��Q"�O�(���G#���/���H��rC�"��h��m���={���a�
������S�|�r�4��q�F#
���[���S]D�1s�L'��?�^�f�����t�R�
����'������a(2�M������q�A���]��o��N��yO�:��%Q)iiiN���N�<x��C#w���-���vD��4j%�	��h�"����6lpe0�41M��"���\6F�EYH���g���(��w�2�Hh�;y����m�����\NIIq�o�����r�}�A�C����W'�B�x��'�h�J�#�q�
��Z�e����s~!
�@�#����<��F����w����G�����5k\#���Kn_��"�1 �!��|4,�������o����s�"������uaG��@������1��F'��C$�\,���#����^�z91�G"��5����q���V����3����!�*�lAdd�A����H
��y�TE>�.V���,|R�R~�\��������N�-��)�)K���x��+o�t�^ Z���8�"%�D;��?�����B!�hH�B4
�"��o�y��`�<��~��U���@�(��;vt
1r�t=z��C�U
Y��P�p/�`[�c["��633�Eq�=�c�,�}4322\$�?'�A#J�]�vN��X,��E>����(�@a�;�l�}s���G�����s����;p,��YC����D���GC���HS����� mh	�9��%�	��h�"�t��>2�i�{�r��i��q�������L����s��h>�{������#��^��>��e:r��(K�@Y�v���[�`�;.�����]} �B����'�hx�A��8+�Z�1G��
~C�Q�G��M���JY~�%8r����|�I�DLC�#b���H��Xc�:���Fl�|��us
L@h�!�w�����(k�4!���\A����^r�H�I�|1���1��Yr������i�����/��Cl$2���=�4���_DC��*h��B!7\��'��,�O��ES����PG�!�
<�E9�o�a�/�����:p���D����
�&u��9s\�C��1�'��F �,�POP��h@!�B���D>!D���|�Z�ro�kiACC��>�����/64^YU!������
;b��@�����v4�5�xQ4&���7el�#���"���<d�y9��o��D�`�
B!�6�|�����qMsJ�B�J�Bx���GY����|����~�R"�)o�<���/��>}�>��#�-��9X�~���.J�(k�A�I�0"�S�x(����?B!���|B�FAP�C$#�G��_���F�����5s�1t�aU���@#��1D1�(8">���c�DC�����C��f"$�~�=�!B�CC�*�����PNq
��k��\/�FB�_�#B�,%c2Q}�?���[A��!���f�uI{��m8���'D����|�=�uS6W��QS�2��N�8/�!���������^�� �"2�z�|�����B���'�B���D>!D���|,\��E���S4��'��y(�X�Q��
B�Uv1�.�� ��,��",���v���P-&_�[�|�y��h;"����#�h����3Y�A�6�@�C8��iX"�1T�����:�(��|}���	�Q��4T=W�^�LS`����;�!�\/��%�	��h�"�u�����d��ah-��]/���C�2��������eSD>:H��Y�l��K��6�]���B!�B���D>!D���|JKUE�%� �|vq4��Qu�<K,fA`p���c���L~�
D9D?������hL�[P�#z!�;~�!�!��(:����[���5"j2g��v��D�p�4Fi�#���du��}!����;N��S�N������~�^?!C��&�I��#""��{��'D��)�|�q�a���A��#*�E4��G7�LGe"�r����Gg���N��`HY���?(}��q�8J��]H:�(k)����d2�L�2���3n��`0�� �O�(�"CT���O0j�*��`n%�w��(=D�!�!�!f1�:�1�C\=�]D����E�/��m�����j�D��?�	��������$20y���D>��>��D�a�[���!Hd� b b��;�VP�Q�!r<��}��XI�r�e����H$���MH� �P� ��Lx6B��MS���O�-�"!>�(="�)��nf�-�S>	H�J}A��L�Fy����CT�����n�8���k
����9W&��d2Y�6�&��&�O�(����C��*�,P�0F��x�0UD,7���H�C�z���p�\}}{���^c���a�^lc��"����4B����a�@�5�h<�~D��9�f�y�A�{"��c������9��!�,*���\d�B��C�D��q4R����7�����'���|4�9�����e(�h�PV5E��9S���2�Nb!��(��{��N���2����l"��� �����#
o���.�r��XH�N�>|��D��S:��!���B!��
�|B�Fb���|@4Cd>�� ���4���P#�F
9�����CS1�aE���[�H/d��D�C(d,
<D9�C�a��:�'�k�;����|�Q��\;���5q����w����~������C�d[����b�F����g�]��m��>��s����D�x����I/~�<A#�ht#t�&B4N���G���M�5�z�i�����z�j���|���
�����
�q<�����O:?�����V����n!�B���D>!D����%!�!t��
��A#�������v�W ��P��D��F�����>��CN�"�a���[�6�v\v��9���!��Gub����G�
P�
��H�H��c4R�Xl�{`{�K��C�$��(D���8����hu��s_,��]������gc>��xGC�F,sSq�l����A�y�����B4n�����(RY!�M	�|B�F
)��"1�
��vWD��s�1������W�w�!�m�������
H��c/�"�� ��(~c�Z�o1��8�=�8&�I��w�
��r�>R��F�������p^o�������Q*��Dr���>M"�w�����!2��n��cp,�F
!'��B!�7���6@x$���A?���������vM	��Ch�+�
!���B!�7���6@�#z�!���G9�|t�D���nY���B4$�	!�B4o$�	!�m��W�@�c�=�`�>�`��{����6��B4$�	!�B4o$�	!�m�*��(�
��=��[���)�q����[	��U!��B!�7������B�e�=�`��j�B���D>!�B���D>!�B��D>!�B���D>!�B��D>!�B���D>!�B��D>!�B��@E�E�J,ZZ�����h
'�|B!�-�|B!�uK���"��[��,����[5��R�&��_$�	!�B�$�	!�B�YV�{�e�r�����,w�������ie7�LlYH�B!�hH�B!��;J���IoZ���V�s�\c�}�-o�P�����.������'�B���'�BQwD��P���p�b��_�>>�,���S�p������D>!�B��D>!�B��#rp��z=�����_��b�{�X��V|������D>!�B��D>!�B�����^����.����X�D����Y��7,gA�����D>!�B��D>!�B���"7l��X������+�t��o�h��^������g�-��|B!�-�|B!�uK���z�W���o,���Z���,g^+�|<�E�"�O!�� �O!����������:�gi�h)������(���X���~��'�B���'�BQw������],������K���^=���a����������w������2RY���?�z��e�G��$~1KMM���[���m��E�����E!�-�|B!�uG��-��{K�t��u����������
�D�FL8���g�����C����m��i6q�D5j�M�4���������o����7������S/�B���D>!�B���!����Yj�o�(>���/v��7�0]�D�F
��u��S�N.Z�����{�<������8^�p��L���?�����;�o�����K���K���B!�h�H�B!��J.��)�Xj�{,���,<��V����������F��g�4h��D��"��k�l��9N���;t����������k��������\��������o!�B�\$�	!�B�
�=K,��Km��������rq44�)���||D�1����g��nX.����R)���k��������m��1�w�^'
!���"�O!���T�eX���,��������~c����~hx�J"_#����N�;s�L���C��������?w�m���-�8�3g�t�~r��B��MS�.^���!~���m�������p#{��Wm����#���c��G7���m��wtz�M��}���_w# ���'���o���J�r�����U:[;t�`�[������B!Z0��wZ���-��w�P����X��8�k�"����������e[�nu����KD����c�/�o�B!Z6MU���W�X����{��m���y��
h��U�`��=" ��m����q�l�2�a�}#|�O>����9��p�Y����Y�f�cF"�W��7��;wn���B!�h��o�`�^�s����.�����
�D�F��D>z�YI���|+W��M�6}-�'U�|B!�h�"��y���s�9n�8�7i�$71�G�������d�a��eEEEN\�f�!AD������_d�����;?
H':NY���,zF���{1
N�B!Z�����R;�k�m�c���fe��%���5W���f���S�:����c��q����.������d2���XY�UG,Z�����0�]I��6r��yu����5��tD�!�M�<�r����4��a����D�c�2`���i��������H���#G\��i��e�E�CdD0�{�{�G	B!D� ��K�x|�n�,{Z��/��|���D>L�6l�f��CU|dCx�bu]��C�5�d2�L����E;��ve�<K����z�>fZ�������b����.V�o#4��B�P��3MU��5s��w"Cgqd�tB�C��b$CAA��
1�h<��#���P]FB�!���]���|D��@����� *
!��e������������
w|���q ��S���<|�&so8�A���18�p��[�d�[��6 ,"���d2�L�h-Rly�)��g���x�2����%�-w�����R�?e��Z^�����c4B��Qd��9�|�AD�-Z���|S�L�S�N��������C�#��m����k�[�,??�}O�(��|t i]VV�����DE�p\FO�!�B!�hD�V��j��b����D����[Yj|:���D�F&����'��w�q+���1�9���8�8�8�]���Ye���"�B4w�����9�Z��7�`�+:������M�c��e9�ZY�����[M]�#�����n>>D:��K����(=/�1L����
�|���O���1�������O"�B��(/���+,��!Km���e�?6$�5bp�z�g��5b����������t�bp'N)������z'V!�h	�l���X�'�,s�"�W�m�r+�:��fv��3�[�,����@GG&�0-	�~�.�g��G���p]����?������p^�����|�1'��s
�B!Ze�;������R��c���m���&~l<H�B!D�!rh��u��E�.�����Wi{�Y���VQ����e��E��G���J���C����0a��
�6n��;D���7�������3g:���=D<`j�E���8���@���{�;k���������W�N�x,�����
�L&��d��b���K��g;-��!K��CKk�m�6��]�l�]�z���n��~�QEEE	O����'�B�fC�����q�[?�"{�~��w��<��G>eE��$�nY4U����	�k^�"��N���X�������]�Co�q�G��FH���M�\�#!8��������c�
�d��;v��+�d(pm����L"e2�L&�5R�����k���TK��#K��u��en�b9y�m�����,LR$�	!��u�/���eOok��cykFZ�������{�����,�i������4��������{���9��I��$D�-\�����<�k��q�����r��-ss���rV�e����������
����4�����p�I�>~X�B!Z���3���u����/c�V�������|B!�hTdY�������-���X���-k�[�9����y����J��hi��A4U����\|�G�vB_�n��'������v|OT_p-��1�1��1,��6c��q�A,���'�s{����<��wW�\��Z��}]d�	!��ePz���z��
�Mm�m�[5������'�B�&OEn�
����!�������N�[j��Y��/-c����}+9�&�G��)��'�B�T�eX�������q�/�_�$�k�C"�B!�,��R+��f��Z���c��=�	����Y��V����e#�O!���(�v��&�������!���+��B�$���X����9����N8���p�?X��%V����ZH�B!��="�V�����1?��}V�y�U�g&~m|H�B!D��� �
w/��OX�����������eMx��O�������$�	!�B�����]1����F������_$~m�H�B!D��<���o�h��������1��;��;�]�����W��B!jN��{-s����{N���j����|B!�hD+�<t��h������V\�k�}�}�����g�d$�	!�B���-�-��O-�����p�|�&~m�H�B!D�����������,��g�+���F��_���iV����n�D>!�B��Q�~��gw��w����L��[������'�B�FM�(�"��Y��g\/�����b����m��ou���#�O!��fYo������u�?gMo�F�4v$�	!���R�y���������w�zRS��c����������Q��D[��!�O!��f�.di����#��p�R����;I�B!D����I�]�������p�����dyk?����w;H�B!��5���,k��.�c������/�a7�|B!�hTD�K���1����E��\!�1D7}��-���
'�5E"�B!����]b����|�P�_Y��>�_?�9VTTdeee�o��1j�������>����^PP`YYY��M!�h�D�����]�9��Ja�-���A�����mJ[��A"�B!������[��E�
������ ���@w��
[�|��;w.��9�n��a���O��A����k�_�m7j�({��gm��������B4
��3�`���K��@\��x�[`#k�[Vr�Pl��:���!�O!��&D�Vv�����c|�n�{,{z;�(�Il������9�����{N�;{����(�U�V���km��}�s����D,//���x�b����l���m�6�'�B4f�n������SE�)|����;�[="���n��s$�	!�B���
+�2�����v��P�_Z���?6
$�5Rp�'O�l�G�v�~@d��#���S�o��9s����������h�";p�����p���E
!����2+�r�e��`��~m�,��Xd#c���p�<+����X��|B!��-*��q/W�	����V|r{����D�Fs��8q��w��w�y^��t�����z���;%%�6l�������qc�����i����#G��B!DcG���V���N����R������2G?g�{�hxn"�O!��b>g���������6���%�� ;�A�@"_#1��wD���1�������|���w������w[jj��-''���k{���B!�
�Zd�2����)e���������zM���B!�&�����,��gnDI���,�oy����D�F���+W*#�|���q�*����l;x�����9s�m����7V��7o����#/����������sk�0��F�Y���Vz�����.��'�BQ5��i���?������[a��B�)���,��w��1�G�������\��z�����-��"r��B44��r+>{�����p��;q/���nHDz�?Y����,����u�D>!�B�o-/�����;�c�i�;�c��Z�(?�E�A"_#&(����|N�:�r�n(�-[��!�K�.��;wV�|��������#/��A�F�-rp�e�{�B]����v�X��O,g��Vrl�Y$'���H�B!��&�����j��������������i!��S�����&M���c�]����m�����}��rssm�������)\G$q���d2�����oy�EV�n��.��QO[j���|��g7�������w-��H���,7/��c�`c����[W"�B!�7aXn���[Z���||��:[Y����M�|���D>�:t��D�5k�����m���b�
7����?���60���@�)��d2���|��|ig��Ks���^Y�[`�����~f�S��������}�:V�6����u��H�B!�H"��=�-���_�O��+���*�
4-$�5b�"���g�w,���{������<7�9��!n����i�&�����U�����n����2�L&����7��������Q7��E�1�	C ����^V�qMuL
�.��'�B�u��.X������K��#�}��3k��4�1eeev��%�;w�s���h�����o�a�vk�a<�v��n��E�]�p�N��!�5�$b�G7Z��W,��7�W4������>�?Vth�U����B"�B!��):���>_���X��^V������!��C�=�z�!R��hw��A[�r���c�<s� ��^��}���B���<;�
������������������Mb-.Ll-��|B!�*�-���Wu�N����l����M�|M���#�B�
��W,������_`#���,��������^8��R4M]����
{��M|c���n��O��]�Z�����d��w�:9r�}��_[�,''�Mu��gO7��O>7n�h�z���>��M��_W�^�A�������:�O&�B4}������m-���,���,c���tH�B!D��V��Kr�o�����v�i���������^9��h8B�Cc!0�NIIqb��3�	i^��	�v�rS�,[�,�����G��&L����}�v7
c�b�<�1c�����k��]�n���3�f���(;~��;���?��.]�>I#�N��/_��w��Y�h�"7g�B!�6;�X���1��{z����t���k�H�B!�-+��/�X��7-��O+�H�x���>j��K����)'D��"��0�v,F��������F�r�[A��W�C(<r���������Y�&����
l��������&���c������~�|�����#����n;)�"��_��B`���k�:'��=��9�+�����Kij��B!����9�\�4"_��#V|z�D�;0�|B!D3�"?�
�����OX���������c'���a�H^bk�������F������O�`�'��Z�~�\��V�t�o+V�p����!���;����<���@����r�!�w��I'8����m��w�b�#N"�!J�/|>������|�B!�hZ0�L���,��w�3�>�����H�B!D
�J�)K�`y+�X����#�:��zB��������g.�O4"���9"�����%K������������C��	�����~�!�����k��y�8&� > ���(A��7n��8q�	��{�g(.�E0D(DpD��{��>>=�}iii�o�B���_?��}v�n��>l��O�����'�B���Z�������sKm�m'��u����b�kG[y�bbc��h�aA���_�W_}����b��M����k|�
�|u,�A$�wD����K�\T����m����{�n���{��)��u�m���E�!���P�_���;w���<�,���B!D#Za#�������2F>i%g�%~l�H�B!D��(��K�,{f�����{D�u������o�i�����&���Q�������Un�jJ��G��������D
�\��
��6m��D>�<��X�#(�1�6q����|[�lq"�^P�����B�&H��J������������9�{&~l�H�B!�-a����U�1�iK�t���
��u=�E�?�h����LC�|uE���Cl����!������s�{���l�~|����D""�1�88\��3����
!�M��%��j�D�e���^����#�O!�7�"����a�����`�������9zZ����b���NC�|��s2v������V+��RmM����������h=|@����:����������.�k0������y5fN�9s����������t�)�<,<B"�2�L&�����Z��/-<��x�u�o��_���XJ(\���g��2�v�
��B��_`#������ML��=V��p�[p#�$+�q�
{�������X>D=��0��0a�-\���E>|<�9�R�����h��[�C�KOOw�y^�f��z���5�r�-B ���k�\�s���w��q7o �L�!�-^��E����w������}�L&��dug�ii�r��]���R����S:��/��B)�-%5�{�����X[�B"�B!�A����Oj��:�W�m��������1�	+���U��.�I�/
!���������
�|�k��H$�����OV����I&2��3��-[fg��q����%�u9&Ctq������Y�f��I3D��s�:���o�%�B�&Fq��/�i��?u~m��[��-����B�5�e%Vtt�e�}�R�|;��F�XZ�Y��g,r .���EC�|D�-Y���p~J8�4��jz\?Cg}�0/�|`O<��
>�B�P��c��Y����c��.b�8o�>}���_wCo}�9>#Qzo����<d~> �Q�C���U+'�G�	!��n)_�p�_�N������y=�,�\����D>!�BT-����S,}����{�6b�P��Y��6Vr�PbK��h���9"�^y�{���*}���~��{�9����V�B!�]'Z\h�Ck���p����~h^��H�B!������3����=�:���p��X�������[��F}�|D�]�~�E��;w��?����>����_�B!��nQ�d��:ZZ����9�I+�p$�k�A"_�!$L.=b�7�t�>R�*vL,��5�D!��`���[-k��z��1��~Kmw���1��V�k��g\Kl-�*
�����B!�h(J���j�J��[���<�F����D�&���,�|1}��uZ�C�E��2�A��	�YIN!�����l�Xe��_qs������P��Y���-rh�E��!D>����{;2d�����
�:*Y����+�Bq7�X��Y�����B_��X��]�n�.H�kb�������m������	�W�Z�����i�[5��m�v��	����e!�_'j�H���c���wz�]f���Z�����@C�|x�z;c���x�2e�M�6��.]Z'+�
!�BTG��c�5�-������g�$~m^H�kb���?���U��N�<�"��������p��w��a;w�t�
!������.`�������}s[iL4��'�B��D�.�P���;�����7Lp�Y�#��>�o���n�.s�l���E�edd��
��D>���>}:��B�My����gYS[[��o����:������'�W�y=��hNH�B!DK#���C,���q�V����a��|���D�&�r8`'N��C���+�����i��/wB����c�n�:;q�D�!�-z,�,��1/�!�n����X��?��	�Y��uV����Z47$�	!���Q|j�e�}1>�t��Z��7��|����h�E�<x��O�n���w��������n>>D=Q���_!�-��
�L������6��j������9]����T2��EsD"�B!ZykF%V���B�~����5�`0�D�&FQQ�����f��]9�����]T��3g\$��9u���'�-�h��3oX���N�Km�����w�.�ge)g��LC�|tN���{wk����i��Y����]�v6x�`�����B!����,����up��~������pU]�D�&+��B��={,77�}
��w���v�Z���O��8�,���v,�!���Qz�����rx�[`#�����4�*r�[��NC�|x�&Mr#
�l������:t�"�Hbk!�B�����:��/'���F��Nf�{$�D�&Fvv�M�2���������"��={����l�2���+�8���m�+W����B�J����Z��R|51��:�g��X��g���:�H��H��[���������W/KMMM|+�BQ?������f>�����%�_�/��8�D�-Y���{��������|��#���3]O9��)�!�z0��L&���]�f�o�������=a�G[������W�m�=���O���7���Ev���v=%��S��d
n�x��O�tV��-
���2m�4�)�BQ_T�X:Q|��m���oY�_�����$���!��9���.((��?��]��U��W��!B���Y�f�������s����[�:g���nx��9s���<���5gee�9����L&�5K���t������vyfw���7��s�b~���t�e|��R��,%-�_�z�5_�2UG]r
!�qNF��������.������)�BQI4j�{�8�7��=���[9,�c�F"_����7`��<3G��!C�����W^y�&O�l�v[l�������&�������_�qt���Ml��y�B!�?����E}-��{�����������k��x��h�4��G4">��/�l�?�|�=��s��/�94�W!�uJE�eM~�����
����}�!�c�F"_��38�C�uCR��;x���=�r�
+�a��B�������9��Ja�6��>�1+����e��i�4���D73!�B���"+����F�����[p�<�Z����D�:�����6x�`'�����!�O���>�F��BQ*
s�`�t��_K��cK�t|�n�,{Zk+>����E��EK�>E>|�������������m[7��u����7�
4���+�BQDK"V�}�����R��k��~c�?J�����W�����y�h�{����g�!�8����wv�����B!��M|���8c�K�h=����^�������J/~�>QI}�|�����iiiv��Q��o���8h|w��17��B!D]PQ�m��p�����g��^��/�&~m�H��Crr��E[�`���1�R���f1�����1B!�1e�Vr����{�M$L/%_Z�,}�?�`�+_Jl,D���+�BQ_�^=�:���SZ�-�Q��rF
H��c�����&�f9D�K�.U��������
!�w�y�'wZ��Vn�7_��,���,c�V���X=�y��7��'�B��FEa��\:jEG7Yd�R�]:(�w��B�aE��%�lH��C�6m�������k�>�l�=���ne9��#��B��#�A���2���
?p���a��c_����6BT�D>!�B4'���9������P�G,����������O]v����-�|u�z#G�t���B!7�7�w^^�"��B�6�Y���b���K\��xb%�[��~Vz�X��Q������-((p��oVVV����BQ+"��Z��g,{f+���"�WZ��!������`��(�Jl�2��W�dee���S��\!��N�(��s-{v�y��/���:���(�4��R�21lb!���D>���'��/�h:tp���]�l���N�B!��}��9�5����
w�q_d�
7�%�07ou�����$��!$��u���\�X�j�[p[�b��Y���"��1B!jB4�gE�}lYS�ID��g�����wb�c^�����"7=��7�!D>:@'M�d��ow>�G}d�|��egg�)N��QB!��E4j�^��5������_�����~����%6n9H��C��F��������]�v���M����-Y��^�I!D��"�sP�Y��'-���q�3�H���,k��V�����B����X�l��AnJ���K�:��0x�`7�IcES�!���X�>�������w ���-�-s��V�v!�q�A"_�x��57������o�B��'ZRd�����	,��|9s�Z���-��9
!��������o����L7�����o�>�������,_���<y2�M��{���@:Z����.\hs�����O'��w�"6N�2��=Z)�UTT����6lp��A�$
q��v�����B!�hH���q��y+>p"+�2�M���[���V�y=�e�A"_���0�3g�$�B!n���_8!/��A7��^j��,��/-�+K`wFC�|�Fk��u�����m��6j�('�M�>���_o�H$���9q�����s��y�5k���c;w�t��z0]
�"Zp��eN�c��}����Z��������6~�x1b�
6�	~����D�!C���D�������!�����+���s�1���3�����aY^���/�9��e%�-[������{[�.]�''��X�~����3��)�B��Z�g,k�[z��na
��1�)+�t��g�Hl-���Po�\��=�k���O���e5���6a�{��wm����o����c��[�7y�d�}������[�l���9r�]�r�9����1���-Z���{�n��b:o����!
g��m���w�	!���(��f�����������Xj��,s��V����-�|u�K�3=m�47l$h�*��8���B)�	Y����1*��t���{�/����=��[|#Z���Z�;�!E���/7o�<7��<����|7n�9r��Mg��������r��g���O�:��B����� ��w�������������B��q\��m�B�0D�-k�;���gn�KZ�{��W��G�H�B!�����<��}��V�_`#���z��
;(9(������|D=����|����#�|D�!�a����o��=.���?wS�����[�|���x�b7��}��e���+��}l�!���y�!��9����f����2>���=�9�+�bOb���D�&
(=�<@>���@4���o!����E}-��g���-������?��U#��B�
!���0��U�V��K/�������W^��^{�MoR����&�E>V��:u���"�~�<}���y����^������#B����|�N�WD�!��V��f{DE!�B�/���Z�F��u~�R��c�C�i��_M������Dau������O?������BNm�v���3[\\��B�H�����?��)�8A�-������g7�^E�W+{
Q4�������1�;��6��b����j�� �E��W��)Q������V�E�C���"���:.+�2}����n���n��H��C��w"=�Y��D>!���aA�p�G�Ctc��;+�9��%E�-Z6�� ���P�m����$8�������%��y���>y�dbO!�
ME~�����]_����z�'n������� ;��uGC�|,t�����
�"�d�	X`��<|$�����o����XQ���|�e�.s�� ���������}
L��V�B!����s�5�Km�m7&��V�r�Y�:�<���6?Vg[�d���
��6N5��s��V|hz���F�[�p��������X!����7r�c+�:�
����{����kykFZ���'��{��{��r���/�����E��"�&b�g������/^p��"�� Cn��:>�#�q��B�c�]:>���'�����t��y'�1����X��9�����	��#��6p<�	b�L&��d��X^���]��C�S�tz�����=m�c��������o���M���D�:G
g�{��n��N�:Y���u����u��z�ksMi��<�X����8����e�	�O�B����<|������1�[��C���e�~�2'����bsP�����V����qwh�a�(�=z����k]�v�4��Q�F}����d���
���w�g�(<|�s���N|�p8��.F����/����+��HV��p���$e8/�7!�����'Z$q�������w
��q�9�|�d2�L&K�+W����vy�v��o�X?`���l)�W��K���k�m�v]�/��S$��!8�}��qCB�����N5�^f�C���x�]��y%���r!����g\����Y���Z������"{�X���]�^Z�9q��!���V�u�EK5w���4��w��u6l��j"�v?}	�E(�}�C8�����g�9����<��m����L}����4'�{����Z��]s��9�J�_%�B��G�[��[j������-��a��^d�B��H��#�F;x�`7������������}����o�B!����'�T!�����v�������-����Y������������e�~���n��'����8'C]q6��LA�)��|�8����O:�A��"{����O��m�t���1�k�A��>���6p�@�?��7���_~���W���B!DK����1�	Km�]�I��;�]1$��HF"_��I�2C=����9df�����!b���������Y!��)�j{�[����Bq���<�9�=����q����,{v�[>��O��hI���	q;4��G(S�������/�K/������+��g�/`�@�_��/�f��c�>��z���w����
~�$��+�>���`��B�z"Za������c�����sC�
D2����L�0�z�����a�w�����W��G�q,?�3��#F�p�I3Ox������B��D4je)g���f��0�E��z��
w�(�������,{j++=�-���GC�|��{�%�.����B!�F��r����?u_J������X���
DUH��C2;f��+�0���;W�p0�4s�����Yf�-�� �1��a%D���H?V����S��&��8��d�k��E��-R\jEyV��j%����c�,wqK�������������a��?���U�
����y�-u��sb�E�q�E&��Q������X�%�1�`��q�r�����t�z�o:F������K!��,ZVbE��X��o-����>�on�k��ws$�51p���cU���,��"CxYg�H?� ��u�s���Na(�P1(��d-�����	;�,�|q���<a����K�����4&�p���-�sD����[d#wQ+�1�
����=-4��ve��v��;s�b����F�c����g�S����������X��9�+�B��F�����a�����p��Y��V�:������`H���b��|q��_Q�y��8GO9s�f(b!�|�K&��Z���?d�8���l�j�_�T���37/H\��Y���Q|=c�^��Y,}���K�����[|�<?��+b���|2Y�g�6����h�)�BQS������-���-����<o�0��������D�Z���y.�������{�{�����zc�n�~�����^�+�4�
5j�{��qy�7���[77w_m����%S�~�������,k���1�/��S1�/Cs����#k�K��j��de���,���^:f�_�vVz�s���LY����D>|�O>��
���B!��	��y+�X�[������eNz����[�[!�����<x������`�[��"�y��9s��a�u�T3|�������|��[�c��Y��w��^!�9!+9��-��3��e�x�����M��D�v1�g���,}��,{V'+�d�[-�<���	�����o�^9�����P�
!������xK�po|�L�Z��C��6��HWM��'���SQ��zK��k�&[���-���Zj�o��=�
���������!���y=�p�+��e�(B4n�s���-[��7����{�^x��o���?o/���y�B!�������ws�1
N���-�o�EK"�-DM��'����V�[y���S�����u�nS�~7.��E4b��{n!�-s��Nd.�~	���/����m����������n.�����^�*_M!�h��^��2��_h���,���-o�p��'�5E"�B�G4�oE�m��y=,}��,��n�=7���[�����H����j%�?�����9qP��F}���y�M�8QCq�BqS��/kj+K{����fJ�������pbq;H�B�"�(�����,��������%d���T�:nZ�{-}�cn>��]���	+�N��{B4a�3�����nA2��B!����r�]:���q�x�X���������v��w8v��[�v��e��
���c��t�Rb!�����^gy�G8�!��a�pFj����h�l�=K��g��������Z+�v����#	����9��B!nJy�Y����3}��,�oEbq'H��C8=����D����w���CVV�Xaiii���B�
���c������3�����Kc�-i�>���a��^�s�fMx�
6O��/�Y��0q$!���B�X(9�B=�.�������7L�h�:�k�D�:�����;�9
����m��iVTTd7n��3f����[!��3*�-ZZ�����,o�`7�^|�-�����4~���u���������*r���"�O!�����'-k����������{�����[�;E"_��3��]��C���s�,??�9�S�N�H$���Y�|���;����{	!��5enHn������s��������"���4�s�_��o-gNW��]je�)�rYeW���D>!�B44���[5,��F���wY��XY����6H��C���w���;r����=��v��y?~����sB!jG��/��9u��a��p�G��\�+�����P��[���,�$+>���g\�����Q�h9H�B!DC��iRb�.Q|?����[��S�_Em��W���k�.�>}��7�	~����|K�,q�-���(��j�'�[����=�����sR[+��F����?u��e�z�r��rB`��V��e�E�F"�B!����-����g�c�)t"W[��5�N��w��}��9����g�
�>}�����+������t�����V��l|�}V�E�c!
&���3K�����.~������+��!$�	!���(9w�����R[���n�M���-�Ol!��|us�1�F�z�o�k�Bq�Q7WV�d�fY���,������N��i���A`�n��7���y��<4-��@"�B!�����Z�����n����m�4:�n��W���;v��v���SO=e�?�|�=���������Y��vB!��<+�
�-��Y-��O��3'���;�"��������y��8�J/�hY�E�ce~T�U!�O!��
�zy�?�P�_���?����Z����-D]"��IMM�1c���������������'�����'���ZxC!��d[�����n�m��/�{�s�o��S�1���|�^ke�����L3�=!�M��'�B��$ZZlEG7&���������"�$�u�D�:$==�&L�`W�h(!���p[z�3��_ay�Z���-�&�������c����X���Z�����a���jei�4A���D>!�B�'�7�X|]F��o��?��m�b
� w�|u���^��V�Z���#G���C�*���*��� ,^�t�������c����������h�!�hX���V�y�J�rs����`�>�#��B�O��;=�V������Nl��Vz��V��H�B!��'�?�^������E�!����?��b��g�p����D�:�E56n����c��^z��^x�{��7���>��9�rrrl��6`����v�q��|�����[���/�4`;!�h0��,Zq�{�G7Y��>.2��z���E4b��YD#���-}���Prf��WQ{$�	!��>`���y=\~Z������������n!��� Z����6|�p7_$q�}A���9{���?����J!�}��%n����4;v����_�|B�!Z\��:������XM+���\�_H�N�����,�mEG7YE^F<�_��B���BQ���]���,�T�]6�-�Wt��]���z��R�����C�e%g�31���[H��C���l��9v����7wD�5k��{����<D>��"�-\�����n�:��gd��BQO�^:f�g[���>��nH������sa��_�m
w�sCxY^�!�B�����|W�^�]�v9�o���v�����M��QLaB����NH��;w�E�<���6lp�����=�1Zb��}�#(�B��J���J��k��k>������Z��-��a��H�p�G,rp�:��	�|uH8����|`'N�3f�����M�6�f��m{�����ywJff���;����c���s"���E�����D�	!�]�����.������,{ZkK�7'���3XD��w�\����U�/Y������	��B�]����`�k����n��u~�!��B!�<y����;�7c�b�7�7b�.]���������Z�}���m��Iv���Onn�-Z��:u�d={�t�\}��B!Dc���q����2F=m�[�[����p�\���F<����nO����=D} ����D���������z��]i8�|�v��Z_N(Cp�/_n�f�r�{�|D��n��9����p^O�8��F!��XYT���g\��]�����n��q����}n�="�2���*���[�"'�8��>h�"�y�w�e���7_����ow���g�v�K���w��)))nA4MC�C�c��G��Y�_�l����?��H��C,�S��Y�B��J���-}��`���_i�}�-r`U\�����u��e��*�U�����|M��N�:�v����������|�e���>D?�|B�:������-������`�$�������7�����"X�;�����d�������������Ms�E��������1c�8q�7�a��u|�)��8x����f���v��5�="��m��1��_����W"�7~�B�R��0�������|	+:���V���}����zE"_������>��9��������i�&�|B�:�,���^d�����6���c�
���^��,��=�9�E+�:�J��V�u�pD��'DC�E>"��Z�je����	&8��H>����C��'B�;p����H�)S��c1"��,b�����?�[�x�)��P!��
w-tCu�����@$���,s���3�gbKQ�H��%�gw���.�n��an~�:X��+���u��z�ksM8���!%���cp4y���3��w�^[�r�D>!�S�����-o�77��,���qA��,���-���-����5�	+\���}��������GB44�Q���8p���;t���������|wD����D��@�� ��x��������_��Y0�O!�h�D���p��\���f����t�e��
�-:�)���O$��z�Y�b�������!�y����[p$k�T3,�	�
���2d�qX��C.��O�5���)�!�����.ZP&�5SgXZV��2�,��9�:��r������,c�3��`�����-%f������{�.-b��.��'YZ(l��KK���;T��d2�M���,�P�\s��W�;V�F8���>}�	v�/����Kg(S� z��}���c����H{����37(���.�J�B��`���m3-����h����D��r��Gd_�81/��|��������� ;;�9��{}��uQ�m���?���|K�.u���q�GG>++��}'�I222�0z�e2Y��7�jJ�]K
9�����ms���Hq�_9s�.�e�,�s���#_������������H,��9�����26N��{]������yb����#��n�|�K^+�7G�/����H>�-D@�����O"���JD���� ��z����KN�����B�iF�*�|u�\�B��@����Nl�����E�R[����t�]�����a���[��f%�:V�?�j	�z�
��L�ou���C���u�:�*+��s���������s���R^b�5cE+���Ieb4�`��J��#�-oa/��WKkw��u���:&�H���z������z��^<�8��)�D>:�XD��;�\�=Q|�v�p�_b
��{�������M`�.� Q{����"���_��w����?�#�sa~@>e2�L&k����R�X��/�"z����w�������7z��n�z�2�����2�Dc���
+��n�M#��uj��I��%<zw_}�Uw���{�J��W^y�9�$z]q��17��igs�����o��2?_^�����-*���Un��P�_�*��Z��6��1X�m
s������r�e�Wl�_��������x���C=c�K������� ;�s�a�	!�8��M�+..vs#��1u
~���%b��D�;u�����E�c�����Q�fu]�>`0>�t�������(>��6��D2���s�d2�L&k<v���;{���>e��.��C�k�m���;�F��
��]_>�.?h�/]�����6��:)��j�D�Z�#�p$qIL�d�{Z(��H>M���������y�w]�O���������.�.gnW�[5��V��Y-s����q�[�>{F{K��z0.�!�u�7�;����~g��;Yd�r+���U�������Y�M��(�q�a�/_v�Ck�����7�3�Qz��#�1_�q���q���c�]��c�c����P^FG�=�b�R!�h���_��e�,����;�	�b���ne�N'��	�|����D�
!Dc���~���������-����+:��
6N��IoZ��[�����F���i����w�kdMx��V�p��^�����SC��@s��leD�;���F1��1�r
��N�4��}�Y{���]/9���m�6k����j��
���t����O?�DA�b�a���^s�!k*!�����s�#}����. k��Vtd�U�����|��aL�|��%�5�`�,��n��-m������i�>�_n������U��Y��/-c�?-k�N�+>��*
�S!��Q���2Z�6'���\��27�H���f�>�eV��s��=S��~�y\��4B!Dc�����=�c< ��-���,���n������,��EK��i�q"�O!�"�n�,K�G#�V��i�)g���yF{��}����<n��g�e���*n���>���\e���$�B��Is��Bq'|5����o����qS���F�o���h[�W����Z4f$�	!D#Z��_X�g�`�d����e�~�-�����-���,<�1��0���W��g���ai=k��^��]�<|)�����#�O!�AXP���m�����z�r�=����?��!���FY�����ES@"�B4b�1+�p�����a����n���s�k����}��{��O��\��u=o,���{��F���"VZ����>��n��KG�XhZ�G��D>!�B8���<����'}�����.z������C�5��E��h$/��h*H�B�F@E�-K�`%�Z����`�G��j�e���&�a#����rCl�Bn��6�v�����e)������{?hC��z�����e�9�5�Y����kEH!Z"��BA������5���u�i�m��u����'�8+�NKl-�������������X��SqQ������K��A/���q1�����������>�-������o���eMmm�{�Ut#{[���,w�7�F��N��i�;��e"�O!�h� �17��g�=����|��~��%g�i�&�D>!����1�
�J�{e)����+�1������o[����+�"�!����r�*�������'n\�?�����h�����_?�����+��s	!DuH�B!Z&�H�E����/��A^�#��)~
�.�O��\��I#�O!���/K=gE�?��O��!��
����+�������w�b���hH��n~=����gn�m���V�u�	�$"���U��O��'����B��G��/]{$���_����;����.l%����hH�B�ZP��n��Z��:��8�-��1����7����C|a���������?������o���p�<+>���3�ZEN�E�
X�����'�B4/���`��X��e��j�[����g�X����5�m��eb��X��}�5�-+�lC|�=�-����D��,|�J���#�]T]���3��eMda��,����Z�����������X�����eMo��Q�y�[�����B�� K��N��'�B4�ntO��An�>��vst�������h������p�k�p��|�C�F4O$�	!D�hy�U�g���z�J��w�d��.��	i3F>���Ki�������0F���/��G�����eOkk���Z��I9��J�pQz��VQH�B!�����~���'o� +:���b�����cm�[x�c����C�������#� Z��8�h�H�B�L�_-�Q�my.����.+�t���� �0�����\���1�����c����z���m����OY������8�fa����l�����'�B4����W�[5�"{�Zd����[f��>���1���?�t���m�n�I�-�3��-����%7<6��Y��2:�-���pc$�p����c�������,�e�fY���V�y#�0�Ec�%�!D"�O!�hT�g�6�����j��[��+�p�"����M#D��+�.�R���D9u��
6��x�
1b����'~1;|��u�����m���C��B�^���9���7L��9]-����]��Y��0���.}�_���y�?p�~�_�����V�y�*r�VQ��X�J����!�O!�h�D�#V��^�_���z����
����_E��Yb����@��O����D�B"_#�ZNN���?����g�/�i������;���\�����cK�,�����,Ds�����+Vr��^g[fX���n�,���Ez�?����o��h���������p�������L���UV|r��^:fe�KN<��B���D>!���A@+���b�c_�p�?X�#�~l��������|V��=��>fy�GXEN8q������(..v�ys����W���|T��7l����p�B���v���9�v���{!�e��1�W����V���Yg[gY��,{F;'�1���h�D���|v��B��V���_���[T#wq_+�<�����s��<��E+�B4}$�	!�M�h�J��vA,��5�
K����B��O�8�'��i�����w�����e�Y7-
�|M��������"� �h��g���l��eVQ_�'~��mv�����F�r�Xyv���_���knHk����5�ra�|'�j����_���O�[���\t^e4�[#V�U.�����?�P�_�
0��Q���e�x�r>z��7L����Vz��B���D>!��qB{�C�~��f�eOokC���w�����F(<���5�u�]���7M�����D��G��h�H�k�0d���/^�hC�ur��u��'�T�F�wO�<����`8k����1�K��37�]��6Vtl�YyYb�j(/�z�o����^|a���pa�_-���0F���z�2F?k9�X���V|:���N�]`�"�i=!D@"�B�H�(w��Lb(n����k+��<�`��`��N;�9�3��h9s�Y��Vr���"�������������3gl�����=���a��8q"��w���4+�1�2�?a�z�J+�(����2G?���h�h$�-�Qth��<�rfwv��R�1����(<�0�yna�{��0F��s��OdMie�+��*��V|f���������K�����B���D>!���(��12\{���V�v��z��U-��6IY�9+:�����v�k�h���q#����H���V`��"�T�}�������(!�B"_g}��Yv��17G����a!���7�����q�)>���&�e�3;X��YVtp���`������&�ny+�X����=��5GL�������h�q1���>Y���c�o;�V;G��w^������B!*��'�B��uy������ra�X;�-���F-ZV��=�x��u������>o�t�7��$F$��c�
����h+�l����F	c�J�K����r#y�W�oE��V�@b�D�&����m���h�"����.����A�C�S$��\EUb��^�<y��<V���8c�WO���\Vu�UL���K,���������1�?.��h��w�7.���1���#���E���*M�`�Y���X�v��X%��J/�O*���
!����B����O���.P��H�=�-��X;��n���l[g���;72�����Cn����B_�Z���,�d+��q*�
��N(R[�1��7+��������i�]���/���~���S�-��|M�`�
��RSS�w<��k�������c���������
bX�������(�8��!�D�^^��_�";��3o�3����
O%gZ��m9��
>]`��������-gnw7�D���,��w�>��.z�	|���m�J+����b�����^<���,���]�^z�?Z�M�z�r��g���X���Vv��X&��Bq�H�B!�����-s��nJ"�������bm�QnNr�J��g��E��?�������z���gw��X����0��j�r�hH�����g��C�d�6�kS����{��M���9����yb��C"_e���6w���-���d�����+����7n���-���+�J���sVz�3+9��������V�y����Vg�����Z�1�K�7�z0��E"������2?���[�Z{���>*���,o��X�;�_���,k����������=��1�B�)��B��QQ��!X�6����������<��E���>��4D�z��k��gC�m��v�`����i�"����������?����k����SmSh���1����gOLy������������(++s���;�-��������Gt�N�0�9��I���bKKKs����.\��E�p�R�.]������]�r-f���U��]��b���,�.��,�.]�j����+����c{���-vm�Z��}���4�R�N���#,ua}��B��[x����|O��N�cq
��������=j�>[���nbWV�
u�����<N��J1�M���>����:�'^y��0X���n��A��u��?*>*�P�?���O���C����vf�v;{x��=v���<f��<�����K�X���-�~1KNW�L&��KW����K�����1'��}��'�e��.�������6M�.���p8\'��D>!���0�	���)+�r���E��o��������gv�P��,s�K������2�����b�)�=2��h�K�[��yV|�S7g����.�h�E�]���a���������:�>N�g�n�q�����������{������n�7�x��u�f�F��#F��3,==��q�8p�
4�
�eA��@#��Avvv�,��+���2��D�bY�����X���>r,;'��s�b�o9y1�/�Y��D,��(f��)�Y�����S\n�%�[��o\[V��e^=o�?���-|l��n���+-u�|K�0�n�g���k���������veB+�2�M����vu�vm�3vm����������d�{?l7����t�����SK����B��Y��,0��Xbq�����`�`���A��I�c
f<������l����*��5#��jLoc��C�m���c�^�������K�f��aa,�
biG:��v��K�*��L&k������r����X�\(�������0VNDb�mQ��c�Fq�\-�Yi��-���+�323l�����X��������7������Y��_���}����\u����QWT�~�f�|B!��^r��E�v#��������hi�[�<���^��J�v�[�g�p�"��8�r�r�f���?x�B=��T��*�Jm���qQ{�=K���Z7j)7����>�_�m�{��v������U�m{��E������m����:����H���si�������^���f[r`�M�2��_��u�����a�������_�|�g���Sl��_M�V_H�kbD"��}�M�6�	|��w�3g�t�={�&M���?s�L�t��
���/����+lN��]�o�:��rk��;����(v�<���f�Z4��U���������1+;��bA��Vz�+9��J�/��O�YQ�P(�x��j�K�Y^�p����r&�a�c�������A�YF��,A���j�]��V��9�v�������sQy�	A������?����X%�1�)r�9���h7����b�����-o� �[3�U|�fK%X�L������]n���k��,t��sCnA������w\��>V�c�[�J!nEY�����������}��;���;�d�I2��8qCL��cf�,��d13333333�%�bY�d[��L������}�nYp@
����V��g��U��V�kU�s���)�d;{���v������n�vh�m<�������u���kl����|�*�t�P�����u\��ff��uY������
�-i+��N�J�B"_l`�8��w���+����r����#������<g�;���{�{l�}{h�l��y6|��?5��Q�Q7�9��#��i_��S�����`�+�X���p�O�������������#��q�G��>�����I/�q����n�?�>�����w���C�F�w����������p���:w���Gl������6�z�[�}��Z7�,j�'�����Z�Ny��F��*�o���/����k�_�������o��:�Z���m���6}���w�uX��>����P%���C"��t\v�\���j��������*w�����|�_�Q����<����K�p���i3>;}��=��,��v�=�W���+�c�|`�]����.��g�W0S3����m�&�����Y�f8�9�k�s�e�,�e�+3=~�'%j.B��\�@������e�,�e�V�����t,�����,������/����%-�eyn�w���|����v-�;�Ck�����������g����j��j������#L�='�9eL���x�����l��M��97�h1+%��.��lpyO��r��)��L��+W�[�6\tm�y����x�;Yr'����gN�Y�#����\��l��c��#E��pus�����n�l�
{7�;��������>��_��1�'���FX�y��s�[N�`
�5���f��X�r��G1��9���&����]���vO�����V)��a�D�5m��n9m������H��
�:X�\��i��^�B�,��x�{�������H��7�����^�+������3�G��E�V�o�e�,�2^zA���U�h��~�"?&c��\n����V�������*�|�U����nl��j����,e'�4�S��k3�"��P��>������	��0��x��;����y����c]���%c��Vo[�Q�m���6��|M��'�<g��v��'�*>���'�o�y�V��|�An<�tbk+����#������^��I���O��f?+�;�����K��~Q���R�_���������rw����c���W{���pnM����X��=���Y��m]����!�Od:���cy��f��y���j�v��w���>�7�>m#W�3ny��Evv�����Y���/#��U�h��v�}n/�!������Ki�S[�ODk��"��g�v��v�uX�[}`G;�c}���!5����v�
������e6.��U�\���6]�����U�o���;�;*��rt��F.�;b_�}6��m��u�w������c��(;s�L�Q���������.�v�x���^�[��[��O�Yd����q+&��l����:��f�'��zc�Z��u���JV�W	��K~{��G�r�w��F/��j?mwU}�n/�������b���������O����Z���{�w���I{��k�.���%�}�B"_��<<��s|4�����w���7�Y^�GT�A7fBD;��-;5��}{xW��s��Y���n��}��[;��~6�N����m��h'���3��Ku��x�F������wc���1�K1�|�P���G:������`�!���.��lW��M
�@��v���>�B8q�:��n�4~�~]�O��o��u����G��w�}h�&������I]�����g����=o��/�h|K�������-�K��y�.i�V�O?�)��������?�/J��}���m�{�:�{1���*�d��We����x�*
�a+�}����"�Od:*u+bu|��-kb�� /<�=����?��*�go�~�>���}\�I�W�	+R�1+U�+����������f��6n����=�������R�^U��6�}>����l�;n����;�]'���v�a���#��8U�a;[�	;�:�s�����r��nE�b��viX]�v\K�<��]����,mWVN5[����Uf{6��m�P;y()���C����4s�nC4 *��y�K�����_>��:�������ro��U�T���ja�V4�����-��3^DY��@�Z�O���*N(c]V���+�[��%�������#q�����E;~�����vp�}���d,���f}9�&����aB�9}��k{���3��Uv�r����P����KA{�}.{�U6�G����/���=o��{�������3�P����ZO��5�p�cvo�������������T�A��kO�P�>?�y{�{��rw�o����X9�*�q����9M?/�{�Oo����c���S��m��~�gW�,�������_g�V��~��=���f$_�/���]>��kg%�V�B"_����)��/��������g��t�l7��|�����!}����K���:v���v�1���vy�_�.�i�'���m>�����-l�'w�S�r��q-����G��7�aI-�����mg���lO�ij�T��]
�9V(�����H:"���#�n_��������f�bJ�M���G���e}�D��)VR������ck�������}���[!N^vc��n���&^�q�����	�t��MK����v�����������%�5��2���v�\�W���Z>�������m�0��}�wc��7�Y������;������l���m5�������w[����C�S&v�O���J�)n�y�O&q����~'yD��o�cz�������������O�`�����9]~���H�����t[�}�_��J�S']=�[���������l�t�^�"�Od:^o��������vS����;�hn/�{�~��_����=�����
*�����}�{��s������9{����q7}��3��+�{��_�e��5go����=g���5����r���|�
������n�OX��/Z�fo[��Z#7�n����T�:�N�����Zg2����l�MX1����c�7|n���s���������`_�j�\G��5"{N��g��1��/����`�-K�>����Nhm�\'Z�o��n������b��/9��������;���|�����V�rt�������b�j���'Vlda���^������^��u�����|D�a� �a�C��k�\�\u�s2~��f�����6v�D��(��p�o;��n��c�dB+�;��U�KW�[���h��V�W)c	k�������,{�<�n����69������GD;"����=\�Y�<�q�����+������YL� ��$!����
+�W^XK�B�������s���~��4I�����R������_g������u��'�d�5y����v~�)���V��kI����������c�Y������y���oY�e�S��C�~���V�w	[�JD"_������
�X���_?������I����gl��nV�����n�����v`�k{�����e������m>�7\�v-l�f���)�x�!�����;��U�^��l���h����a�Z�������=;�Y��
,w��}9���&:�[�x��I����E�9���w��CW�\ylu��(vn]e�������/��#������:��.dUz��Z~`�]h1��}�����to�n�����].{��A��ym���v�IyQ�q�gm�2���{�ZO�kEo�7��a�j?c��3�=�sD���Sm��1��m�o�J�,o�����}���c����]wzN��=���)������#��9�2{��B�^O�W��3^�C�c�"���������9J���X��l���������u_�j��xn���d�JvrB;=��_�ua����S��l�K~��}I����	a����=K�:P����j���&i���>���
�o��T��s�Y��-[������q��:?5[��~?�q;�z� �f����h
\ ��Vn[m\}����D?����6���j�jh��x������������[e����c�5}������[�hF�~+~c�9�m������`�]���n�������o�����z�������/��c�������������_������{���^�C�;O@M@"��t��<I��)���������sm^�/�[7�t��������������$s�-���������/�~���A�����;+�o��l�R���oe�d?-w��W���7����=bw�|������_�9{����X�W���o��-��WZg�7]��n�\~@��k�������[���TR�li�2�o���v��X�Y=}C��5��7�2A:�1�'�����L{^��r��Y?�����5���e[V�����]_�H D�-���nG��~;���5��c�1f\�p������P�����^������5���GVmD]/f�8E�������G���){��K�_���`0�T���}�N��������42D�L�3+�*�O��{�m?�3eo0�����5������4��wm�v��e��u��_�N�Y������/w���-+�X�d�r�����s����O]'���E>��b��6����u���f�ra	��53����'��n�VM��������_1�wD8�]��Z6�S�94�7y�{a�����:��]����Z���D��bg�E��U���O�=9�w�`���b�u�f�V��S;[�q���[exm��/��z���n��s���j���y�� t�� |����|d�����G�!�}'�}g�i�	k��S�c�}����'m ��;!��>b��:f@S�/v{�{}������r���=���g�����_��F���Q���u�m9;��\]���Qq���k���m�����Y�]�����g�@��
�K��:�C�����3F;����=�<���3L9�����J�.�-�����|��J�����=[�l��{��X�;j�([�vmB�����&��a�>oy{e�"#��n�G������OY����A�\�V�=��[�d}����g6����6c�;���8r���s���^�Oz�oE��s�p�R�W�Z�������f���[��d�7{�
���A����
����:�	���w���>��w�M9P����ov:_�Q�2��Q�|p��z9����X}G��6|��k?v~s�n�Za��������V��5 ��w>b�����y���C+2�����T�Llm��������znL���k���"�J���{V��K���Kg��!g�:��#B�[�_�E���%����������-��Q�_��{~�;���s��<�������o�=�`���;���d������n���k����c���,[�w�C��lS�lv�]��h��x�n�G�D;�?�3~z��'[�;mo����W��G������I[�Y����f%�=gu�f�G�"�����C��>��1�������k���U
{���?��w��
���?�w��_����q��d����/C`���������<5��a�F�c$�E���6��_6��f���W1Va\���q+@�����u�t&��m�r�<~���I~��y���1G��U,[��`�R�^���o���E�g��'�Yrw�G�?y��\f),�f�?�dI�����$������=c
���"��C#�j��;�awT��=��e��c+���#�[�����1�lw�����q�vf�%��LG��|��
k�y��=�hhx���m��A)�����?o�K����Y��������q�Y�����C>����%h�u�`�`�>c��24�����nO1�)�������hI!����a�K�B(�w�4�`D�9�wHQ:����{����
^�"BJ�L�O|P�9�W������!F�����N�b]f�r��������gO.����/���������BD�����U��m�����������K6-�����B��0�����^t.3���"��/������ ��l��K��'�c�3�2+�����s�����E��n@�a�m�v���-&��f��X��-��P\3��P�u�G�����+
�����
��:�*~��D��V�o9+�\���=����<���%�v���������?�QT�F���{�����/���fo���g���T�W���/�c,�L���&��|���{�'�A_�\�������Iu��$�7�z�o���k�L�����dt����U�n�t����?uu#A�XQi�;�������l����,&��oe��?V���S����l��������/�%=��I�BR;A{���,�M��oghoh{x�q�n+�o�h�h�X>K�u�v��E[G���k�����,�e�����F��:J}}��a��O��"���L ����B��]�s�����!K���o]m�v}e��o��G��7P�e��#���P��v�����D�
�}��"�o����="��G8t��M�>=a"�
�/�k������~���i\[S~b){����eh��#�G;����E�:��o�HC��,[��,_�m����G��=�6Y��������^_��9�G:?o���Y��v�j��D������f��������x���;2�o�j��i7x������}��Xk��b�z�������������N���k��M7�����+?����a���}����y�����G��#G�X��o��]��n���}x����7�����e��)���a���^���u�h�K����q�N���X��9���_�!����mZf�p��G����G�Rg���>h��c��[Q�m�n�|l6}�f�g�|����cnK������l������X�eIo�G�i��?�tx��V~�N;�������������Q�����viHm�4��]�_�.�)gz���������|�O�\���\�v���v��[v��kv��Kv��sv���v��c6��_-g��[��m��6��(���U��7�W��]*s��,w��,���p������J���t���L��L�7�l��v�};���]�U�.�b��k��if�'�7s��-a�rJ��F������^�����j�U+,�$r��.�X��l����&��M�������*L2��D��k[�4���L�3q�t�
?I������U�Mx�*�����B�����=} ���rt�c
�����G��ECcn�ms�]�a���Y��2�|���y��;n�������/]�1M�8����j��
A#����A$y{����~�{�q>c����'o���32.aL����W�����CR|^~2��?��=P�Q���~���#�DsxC�w�p���"����s�_�
��2n'����^o��G�1�bR9[�VtdA+>�����G36s�M ����l	�H�An%�D�c�������r������~=>
]�����z��@���t�9�[���h�g}9�&��fc�M��bU�9�}d�%��"tUZ�J��cQ7���9��w��������/{��O����=~��>���~mK����t�}���56�����I�^���H���A�;��f��{�!��H��;�X�O��]5��p����-�!��J�-��yzMK�|n��.e���O[��Ml��~����>����vo����)����3���N�����9��w�M[X������������l'�@���������;�:�������s�>r��}�������^�����Dz���+&�}��.Ex'R��8`"d���|��i�&4hP�r������q�&��v�Gk=`C7v��(Tl��������Q��O��G�����N��3Val1��s���������Y�6���`Z����o[e�g�������1e@����TwzE{������8L���r7����Vp��^�#(�c�����}���k�����q��S:�#5�A�t�����&��O.c��z�o���_�u����e+>,���p��e����f�����>���~���]��Av�{�����_����M���W=g�WL�K���E�m���6wN�3������fOlm��5�Y��L�/�T�f��`�{��������lj��6�MN���=���
���yR��lR�'mv�G�}�k�����R���9bKO����\�6���l\����M*�k��~�W�/��gC�m������Y^�w6��ol|������d��������vG���J�?&Y�;�M(�'����	���m�������g���N�~�{�!��X����2w���w�2�Y�{m���)
_���s��nEl��*6kLS�=���]8���b��l��uI��l����E�n-��������������
�`R�0�z|���o��uu�8�������L������Ll�'(��bB���6�
z����
��z��[g����#�X��~�L������������k/��ON������'������'1��7����L��^����Ypi� 6��+�+w�O�����S���?�������31��e�?��E8��#I|�����v�'�����?���n����S�=}m����Eg-�����3�c�����%��������H>����[����={��>����,(������j����_�����w>++�6��C�x���_o���wYI�U��'2W.k���L�w��j�KT���7y�������3v������^<����;�M4	��u�i.=y�A���3D�o�K_i�X��2Mf�X6;:YT��� �>��|3;�x|K�3��_���S�=U�9 y\GH����
��!��
�����!��/"���5�!����%!z("��G%7��~��aMDbJD�sB�"�d��x�wtt8tz��Eh*>���������`�/���%m����������-)R2���aI��y�I��������B�}g��a���<���2H����4��2���L�\��}>��Ik����p)��������:9�&�����Q�7�P�������9��j<������[�J�D��b!�&�}����
.XNP�
8�&�aa�������l�_2At+�K6/�5����o�ogv�6����A,��]�}��������=��;��:��s����=�t�������D��aO��#G�,���}���u���T~��]�t����bw�z %B>�n�Z����������^����O
���?"�.\��%�L��e��:����_~U�N������>�t���[��r��a����~W6����~R��Vv\	�����!V{Fe{��3.���w��_���=X�1/(E�?u���g���~[�.�]���q�U���|����W}j{��Sq��s�Y�>?�I��O����}���������~��'��	*�?�*b|g�����v������}����Uq>������?����nO�x�~S�_����m.-L�!�s������-=-��|vg���wUI��Q[����J�i�w���z�#
����y�<pis�����&�)_>�Xq�F�Q0���,�{�Z+�MD�Cx+>���t��;���pu���Iv��su�Nw�`riH�����+��������v�K�=.��(�������tA{����{��=.v���{��U�T��m��"�����jL�hw����.�i}7VF����>oM�����t����������j��-_�bV�_%�>�������6���6n�$���|[�y�}�{��9��N���������\�I}�YZ}iz7|�h�o�|"S�~%�6|n��`x��16a�x[����8=����1��c����3g`� �{!��o��<d?�5~��5~�7�
�R�����X�:�����U����Q�����#m@�>
]g��3"���Y�W����/��0���f����rT��
��Z,;e�)a�!t�Y��j>nw��`�|��X���,�
N��x�Of��n�������0�P��G���J��9,��$�PY�����Y������
,�fo2B���[�g���:�B<aB��m�����(Kf��&���������F�����w��2��>���Di��Z��t+ly��|=���(����)��c���+X����E��V\�*���9��65F6py���j���&~EB�Oh�����6^��L�L�hm�u�3�<�g����������{������� �7K��{��e�B%V�w:g��~��o��e���*��g������Dp�bg���US�2�����{]��sm^���K�l�w"���� N�s���e��o���-����;�7x�& �����(������|H��D���{���+���U�q�������K�<x�6n��E�k�������n�B��[��-|�J�,-��#��,�����6n�{s���e�ek�������A%���'��}����76}����;�sDi;���qsF��5�vK��5�O:X*Yvd!{����e������qo[��{��d�k��}�U�����[�Qmm��=i~?��C��gk?��;�� 
�G�!y}�l��=n��s�N��,��o��p�lQ�
���=j�'t��q�l��6}�T{������6>y�� ������J���3�i��4��qo�:v>�[���N)B#��������k1�����7��Gk[7o�%���S����sk���}��M:���7��=�|�U_��-�����9�5�w�!���N�xq�������v��[v����;.>m��M�m�W���F�[�i��r�0�@�c���
^���?����J���_�-{�[6���7��0�a�����qi0�����[�	����/Z9W�wGo�eU'���<gm�u�����y��9ba����>�:C��n������|��6����m3y�v�&��!�1y��
�D&$:�;+����HQ=D#��(��"b"����"i��>��h���~>6�g��s�y
1�=�u����(�gTy�%��]W���?�����
�kx��=�X�"tMJ�ZNio�����/�����n��"�A�b�=�>.
l�^`
���
���0�F6��%��G�bVN�����K-����#+����6�����1��@>��r%D�[W�W��v����/�@��~�7~[6q������z ��L|���>�!w�V'�2�M�I�V���s��d?��-#��2K���������z�w�r�[��5�����������~����
�j��!�viq+��=���'�������j�*�����;ws��h$�7�X��9��&O{���u���Q���n�����q���{K1G�~����X��y�����{��M���OZ�����=��<G��wo���>�'<a�7�Q3������������!#zW���+i�#Z��3��
_�ze��T���p�4{���z���g�#��Gc@���O4z�o�A��t��wPn_.�P^i�#Z��������}��C_���B��^R?7lLZ&����58p`�5�Z����5����jd�l��E��zOZ������������l�~��e{���~��lWz�ik��z�J�x�V����K�����������iO7����Q)Eh�:��=��������]�����9�5X�u�����_�[#���w��;���d�J!��Gk����~��Z�n�'�J���;���aO6}�O�;}<.i�����S��o?w��?�@~�_.�M�i}?Vv��y?���9����_�����{>�n,�K
���Xrv��i?V��<{�.^��r}���KX�}_V3�C�|?b���5�����oX��e�����o`E���Z���(g�MO�"=!�/6�����;[�R��u��>O����l�7��r�>b��:����O�hc�u�	x3��3'l��V�����H���u�����j���6mY��M�'l��m�Z��1��R�1��]:>l��
���$�M2�\t�g�"�V0��2�����Uh���X?�
j�iM�nt�\��G>���!W��]����ox�����|�{�]�^��\}D�[�a���q���]{��]��k����\r�A�g����v�d�����gm���V����xrx���V����l�W�?=r����������o�`�G-{���;���_���rC����_�y�t������~�����l�������6)��{�����ZN���x����eiw����.�����L�X�<`�/	�2`�)�u��'��}����&�����~�m�n��RF��B�X$��"x�~��+��{"A@��q��_<���dC?jK�����%��e�}1��,b}]&~>�6�A�%p��[V��%c���A6��c���v�������S����sl���a�g#��
������"���������o-v�A�$�G��W���?�q���q�"��m�MY:���:0h/j�j�oI�k�a{��������|��l����3�x�l����wg�4�w�0��a6w��B$Q\�t������/N�r��v�z��y� rl���l������3�s�V'���d�������"��B"�".f�r`����N^BB'��S!��G"_V "[����#����@b�V(����|B!�Y�|B!���|B!�Y�|B!���|���9r��m���v��}�o|B!D�A"�B!D�F"_&d���6`��5j��/��"�/B!���H�B!���H��d����?��������L�b'N��7&!�B���D>!�B���D�L��M�l��������6a���!�"�"�O!�"s#�/����I�&������8�c���B�����gB!��zH����������+�S!��jH��d�X���M��_������m��Q���B�5���1��k���5�O��s��
v�����&������g~m��r����'O&�XxVN�8qK'���[�2=��V��������B\�|��9s���a�RD>@����?���B!��2�Q�~��*
"����k��
�O��/�������DM\as����h�"��t����S����>�e��Is��={����+m��U�����=��:�f���$�0lI4o�<[�p��������Q�{Fp�UcK�C�a"��V����G���;|�P��<���^!��D�L���� �M�>]"�B������@\����wqc��q^h���wdB��%K�E��� � 4$2���!l�"���[�6m��%�P�y]�J[�~}�&���n���
8�z��e;vL��s���f������{�&� �v�����k#F���-[�rHTh+��[g5k��?=��zl�D�+[��
4���D����C+W����]���]����p��a?���W�\�������D��C9M|�"z�.�#?������D�?i��H9��[q}�����7�D�L��e�|��e��;���#m���	�yB!D�A"_�q��W(����[��u�@�b��~7�KX:�#���cP��h&��D��$�#j�6����OB}|�������aC�v"y�7"'u���)"
��ODoqM���fp���h�D�-� j�%�#����?��={&4�a�y��^`b��D��D��\���~�4�=bg� ��$Eh�����1��9s����3f�v1Qi��	�i����,�e��qc+V��u����E"9r�����/�o��D�]�5����M�4��B"�l���:w���I�3�28�|av�����������s�T���d2�L&K�1����R"_��I\����o������-^�8n�'���/|J���U+]��o���i�T`���H&��-D�X�=��E��[d@��z,W��A~�z�|teOX�K9#,��)B`�K��7<����@���DFr��6"[q�������i":�<�Vy����������j���@�xN4��#:������D"���5m��j����D
\L����5���[�O&@�����Lx���C��Lv`�W����{���D����z@��^�����0i���>f]��Y�$Th�L&��d�3�i�>��h���qa�������>��%�.VP�"��S�}*������Dx����z�� ,!���,1��=�W<��r�'O��K���?BL�FDD�X����5�)g�a�{g�O�p�-@��|�"_�������#��M"KXc	�9����{(��#pSO�Q�s�=����@�%��q����\��.�p���B�/�a�:u�2'���G:�7A��#_v��G}�N���ji����5<�k����[!O��h6��DE��\2�������K���3�y�� �aoh�� ��\�l�@���?�7�P���E��	}1m �mb���_/�28 t�4�4����!�Bd]$�el�U�B����T#l0A�}�v�(����o�#����"�CtS��B"&"��X�����*zDO 6a���XD�!p��?��V@#�3���N�[���y�z�-"|�sS�y����BZ����e�c`c)�q/�+��\�|g�����
a%DOQ���r��y�����������"�����deF�$�u���7���)Y�������<O��2~g\G�S6���C�bUD_�x��XDd�!��O}� �,m�p����2�9����w���9"_�\�r������NG���s���3O�7���Z���	� �B>��X��?}�:��nA�$?�!62�@[@�D`�~#�~��DFsR�x6�l�� ����~�����D�L� _�|)���B������!��hX�?���(�X���5���h�0�g��`A�n[�"a��@,�� 0���"#5V]�~�X���q�A;�L\�(%�6���9��� BF�����k����T��F
�i������"SX����=�Bd�lx=AL�� �E�Q�p-��N�:i�G�{�!!���EyEu�:��rm~"n#0�.��h�G� ��t7�U��z���g��Q�<����@�S,��6�q�v�R�5�C��kMJ;@��"���B���#J��F�'��|�$m��u]xbU)O��v�����/b*"u���t��~�f,EV����+y@L�M�C�#oh�{�F�I�����A<$o��gD=&_(o�
���A6�"+�3u����*��!=AL�b�!Q���������v�Q�F�O��|���JG��a���/�B���D��	6g�+�
�".1�F�a0P����������X���O�N���a��/�AXc�� �'V�+�A��2�B��2�3�=��`a�4��D�V��k����qA��A=yC:)�X��a`����C>3�E�C�C@�3����;�!m��X.
�����A<�Y����Q��hAX@���&�A�#
D� ����@���: ��97��8���R��h�8������<���Y<a�8B���I�P�������=Qk��u���,R��H�B�+�l,�F���E�W�R%�D��7\�:@}��|���1w,u�-�>�D�H�6��02�i�Fc%rr/�^4����A��9��@�QWc�;�5��3��:Z�2B�@���l����0	���A�� `���BL^�G*�K�3$�	!�Bd$��k�����*4���6���X�D	?
"`,`��`��
�'��Q�������|����T�T)��X.�bp�����'�y"�H}��+�X ��8�`�������"�q��`�A'�_�B/�D+���
������,��`��.�,��*��(�h�*r���>D#��R8"��B>QO9��-K%���D4������I����������C�ga��z��C]`�$��v��^���]��kc��k�AXb<M�P�zu/�Q9��?�Qm��)��N�#S�a,�g<������O}�L��t�W��S��+D z1Q����B?yO�$oxN�b!0q}�k��u/L�9����Kz�;D'��X	�DS��.�eO:�>�#�1�rG��.�G�����h����S(��~i�w-��4�J��&v�*&����yG�&���LNq}D�h�?H�B!��H��Cz���R�B��s>6u(�Q�5��M�7�(@�A_,`��������*�� ��"Z�GD3��)pO�N�b2Y����#O0�
�Rb�J!���H�N��?�A��f���;�:�HGt�{��{B�@���Dr��O��P�%"/��H"��.��H���D��i��P���p~�!�a��	�����{A�:Ic�j��%
\����!$q-�����6�/O���I}�6_��H��HY����	�6�'+�B0�v�'���B�#�����HD��}��c)?���G� /��%_��D�-Z��AD��\��PN�]�|�9�
y"d��s��6�v!�(ND�B�EeM~���?�2bn��<����D���XN������N������k��P�xVx~b�'���B!�8�����@��"��DYG�7�*,�����H
"b
�ZJ���A?���X���+K���� ��;"��@��h�c��9�B�`O4
�-��9�F����e��=00���V�!?>���!�'|F�v@M����S'/t"XDF�r~�����f��Y
�u��i�P��A���f�8���b����� .1���X�h4�o��p��Q�Cu��'r�0�m�D9�������bb�@��rC�A����D>R�A$Bh����u!�g��4�`$�Z\�v��S'�d#���YD������<�G�B'�]w)�F�-E���6��S_���P_"�N,����
�)���"��	e�����R�!�7�� <�GD��fG�9}T,�?H�B!��H�7k�j���c �c tPW����a���jb0������o�Ub�<�/�
6�K�"����#�0`7B�KOIDAT#�
 R��DE��"�@�$��v#����!����|��>�g�H��"����S����bi b_�=���>"/�K,���:�}�o�D��=s].�����43VQ��Z8�f�������������?�K�c�C@
eO'M��\/�Y���X�l!J��O�B�#lQD�T�g�{GX#o�"���X�oFh�e+��7��.�\x>���E~�	�h��{�bh�97�.��� ��/�"W�����i'/��a/�p�����B�����Q�@J)�� �r
�F�=�����'�n�|B!�Y�|"�=���0p$z���x�bJl D���6���!`p�X����Yv�`6�����*�'Dd4K4pn���G �Y���!�	Q��g�OD!���b�A�������<���X.2R�H!�X���0f�~#� x�z��J���:�'87�ASR��#rA�����G���sH�R���5�����B=�9�~"�!�a���s���H���u�� �b���C~�����D#&��@l%B4c4�p�y<[��'be�s�����!u�������Y�LdP��B[���}c���� �R��-u�3��A�3yM��X�2D3�M�����M`B�:��31k�7�H�B�L��:h!D�B"��h0�co(~�L�#��{�]��9���A����P"r$D���j�9� 3|�>����7����"���c��s"J����\�YX
��
�1a
q)V�p� `�!�B���1DJQ�=AUx�u!�2@0G��>�g��r�1t�dC�"R)VQ{��C_@TV(G��h5�6@���8��I�#u�tG[�v����#���A'2Z�|@��"��J\!?u���}!v!hr�%<w��g��@[@c!tR�9?��������F��DzxF������>��aO=ED_�}�C�������'��F,���v�([�O]�-u�g�4r�X����fiw������J�	v���k�;VH�������)�C�����9u��gt�����9�CN�Sd�Q1��,U,g���aV�N����Q�l���,�����
�����5�1#u�����`�����9.�����D>!�'�m�&�mW������R�J��K$S�w.�>4>;r��|F�Ap�D�!���W`���H���sX"��K$k���-��������A���e�O� ���k�;�K?��S��y[�/�O����K�>"�O^P���3 �!�!��3����z�k����h���#��!������������|B����u�_�����Ie|u���vq�|!���#Rq� �"�q�O������K]��$b[�a��������c>�9%��qK���=Q��+�v����g�br�I'�"���+y�
�)�OdJ�u�U��|��U�X1�5�"��O
�`$8*��� $�^��f��>��oR9��#����r����������A0�Y�ti+P��w�#g�q&p����yH��AF��	�(6�X�O6��������a-6f�i��)���"zK�"�D���|N��T,GC�c��RR�W��g�FFvq=��E#&��A=B	�MXn����B$!}E�� ���gE0A�$o�����hU>y��AZ#��X@�$ua�G��(���H|KD����h�B("�����e�������!�s��D%F
7\7D��# �!�PL�OsL4pn���M����|���5	� �h�oy�p��/�r�����)<a)/��F���h'Ty��o�����N�yA�&��p|7�7K��4��)��a�:��x5V���F"����X��N

aX�f%E|`�M(s�*U�7���f�\R68�4��\���s$����2�i�W	3}�7�F:������)8
,�������P"��4���Mg��q���qLik0+�O�����`1��F����|Bdh�xV�G"#��
�k�CD�6�Y~�A	>�[�q��D/�Y��D� B������.2IJ�l�L�'���Ga?9~{���q}����@Yp��N_K����|@XA�|I�7�UD?���NDc����{!��jB&eB=d�+�q���&]S���y���r�X���=��sn"�C�L� ]�}<&!9?���Q����<s�R��"������(Vy�_F$k�{f,C���c�h���!r�����y��u�>��*��H$��L	�<
w1��<�4">���Q���P�5NBh�iD�h�Dl��CX
�8Ne���&N�g8|.b�O�2#����"�~3k��9(s\Fs��q;��p��^Y�h�$�%N8�a�j���3-%������H�"�@{�$Qd�O��_��d�_�\9��E�t� 2�N��m;>}�������T��"�
�l����O�?�n��,���w��&��6��)�6&�1�����
��>��a�J"�'!v�
�{f�T�R^�^������4��U�V�e������[���3KM��Ib��L��2�?����9���Q���4E
��MX���s�U%�{������4�����("X,���D�6n�8Ed&-�?��3���_����6����_'Z��.5��;�m,��x#�OdJh�(�-��G$S����-�q�������^Qc��g�@:2:s:� ��� HQN"60��LcRSC�#B��_�;��;8����)gf��l8��E8e��!��o"[�g-\DN�g?<��|B�k�@�5����G�`���D[����5cC���3�"W�o�v��=��N�9�\C��>���=�uL"*a\?�����~����$
�[��eN_��HK^�����9�DSQ�+�9/!�o�Q�m:���!�-�!.q�L"�.���U�$V (1��I�C�hE���1.�����j�;>b.��'����c��V�F���r>�1�w�����Q6L�r]�5��1}��S������D��noh�B}�(���D��A�)�x�q1BB�)����a�.�)Q}��3;��@���fC2880t�i9�808Q���
8b�G���0(�!��"�&8Q$���yf�����p�:�+��J>��������D>!��`�&> DR=�d&
D�����{����>s��g��	!���Q��t����#41^@\���?�.�r���DU�5��B��]D�!(����!r�%P�[��rF8�?g2�.ro?�#_�������> \�����qN�<���P�0�o,�f���	�-�-�g��PD�Q�n����+#�����|�����x���)���Q��<����2R�HZ%��L	�
��3C��-�
x��G������U���&4Z�:Ef�XJ��)����C�Z���1�S3f���d�=^��r��,���z ��<1{(�%60�1���A�P��B�
V�zu_v���$�	!�A�wE��o�FL�Q�$D�!���)�������)D�!vEFt"��G,���c5�f������������
'D�"r��*�AlClD�Bt�JL�"�ag���;BL@�
~9���c������~s� ��/M�6��q�=u0,a����������h�NC>�������H��2��W�O��^Q�C�,BX��|��B/|F�"��'2%�Y"��Qnv�D��eER�|a��:-v:U���D�S������I"v��AD�����dpVqfp��k��Q��0#�3@�'�R"_��1����,������R.�pf�3�#w#H�B��a��2Q�
�l����xy��|��$�+l��*�:�w��Y�"�����bY�I�'R/z�#�|b���k�,?-VK3�p�|�H�7'��SYL�K2V"�D��FcbF|�!���m����
%����|d�X��bI8�O+2�<"�S}�B���<�im�B>�K2n��y>�-���|"SBc�l*<���p`"���� ������|�o�����tf��'�\E����{�B����6f����a��#��J=?&���7,��7����1�L��NfE"�B�/�X&X�xq/�$RT@����}i`�
���AQ�Z���^�<��@L
rD�0fD�cb�I@�!�����/���w#$3���>Z�����_��"�@���N�Q��F���I�&�F(��D��'����< OYMm�`E���R�5�2��+(O��g������H��2n"�.Q��|G�$�B�J��h6�d��p���AM|I-�����aPMy0c��)�#~�1���O'MGF9�h�\��A��Q�!u�H�'��J�����yc�fr��N�Lyf'-��v��e�4��A^3��C6X���wfG"�B�/����e2*Q|Y
D5�C�?,D�� ����S��Dp����c� j����0�\/�/*\��������d��~u�d����L"���_J~�b�
�A��)be���"c�.�4c@�4����<y�x�������3���A�U)�K�j�?����D������>V"��B"����G�o���f��H���Z���aoD:��`S>�YaP��W�,��|[�3U�	�3S,%���`	�ac��z��C�SQ�Z5+Z��w,h�$�DO���2Dm�?���B$N�>q�"��B�*5h�c)l���O�O#����>�+���'cY�->a�$��G�#���_�V� ��c��N����D������o���m���%���dd�A��< ��
8������CDDhe,���_���=2&b+���������WFG"���)��{
Q/BpC���J4���*
.�(b3Z}�$�@g�S����D���#���f?:.D@�N���1�0����������RpbX��
�E��N���G��f�Y&�<?�;��0	���K9A��2;��Bde��$���!��D�{#|���|,^�2�/��
����,OE3�e�~0����0�y�O��^�U+�g�C4"c"�r�?�<��6l���d,�F�Dp$=����|"��LC��\+W���@��Y��G����,X�����g�Q[��#B�����oeI"m NKr�����\�rY��U}��(�������
�g�X��S#�%6�����S"Z��ic9s��u=w����ay3��{�~��/�Dnr~���'�B$��(2��)�0���G�����	?=��lW�(:|P"���F���e!��E"��7�7GF	"8rl����L�=��1�n��C�#����.iE�o�	�|"��`���;h����g��D��C�#z��t����2�P��W���g���s�!������
  �����^��G4+�R4K!�w���g�<�����G�O}�'�A�������VN��|B!D��b����������'��c/�L���`[���U	+!��� V�!�M�	cV}1��[�����X���
2�|;��@����L3O@
B/� ��LH�B!��H�B!nL6"�!��L���
}?���2>^N�z�I�X�D*�,��>	�>G��A�>����X�����pt �	�Hp�D/6j���C�����D>!�B�,�D>!�B�����
�V�Z�}���D��@�$b���b9�^��'�B���'�B�D�C${��D:QK�D�7<�f�"��,�I�B!��H�B!D�A�ad��K��X"y�1/�8p��/����1K�B!��H�B!�����_$:���`������=�+��B!���B���e'���7�n��=����D>!�B�,�D>!�Bd5�9b��O�-�#�O!�" �O!�"s#�O!�" �O!�"s#�O!�" �O!�"s#�O!�" �O!�"s#�O!�" �O!�"s#�O!�" �O!�"s#�O!�" �O!�"s#�O�%�t��m����6mj�*U����[�2e�v���d��8^�1y�d�Y��u���V�\i����������3�����G��}�����m������a����9b�}��-Z���;���������v�Z;w�\�'7�����=��3��n����"H�B!���H�BdIh����/�������c#F���}�Z�l����G^D��]�Z���m����j�*k���
>��p�@�:u��������D��+W���C6n�8[�l�=z�&L�������G]���Q�nZ�$-\s���^h$���A"�B!D�F"�"Kr��y[�~��"������>&>�v������H�;v�h7�������5k�xQ�������s}�����~����sq������]�]�v��q��{��(A�����p
@�
�#�����W����s�/��s{����xD(�=�D6r\��sgJ���k���u��< �����m��
���\a���K� �p����ZB�� �O!�"s#�O�%	"�l�OD�!z}��W6m�4�<!��CDPcin��d��~�n��-O�<^�[�n�m���Z�h����hO�4�l�Z���C��s����?~���������m���G�������������>�i�7o�o��{�*U|�X�L�^Z�q;l�0[�b����'|�87����O�����{��w��~92p>�(���9�Y�f^�$��HG��-�(�u���-���'aX��uf����A$�	!�Bdn$�	!�$A�C��� c�*��o�	�������o=z��QvD�!�u�0�wX���@��!\����w��}*R����w�������������,�Eh;v���#}��b��m4��b������~�D�&������������DD�	�>Q��D��;���Q#�NA�������v�
Q�K�.^�D�2d�?7i�3��w�{���?��/��s��D,��x�^��B!27��Y����+W�J�*���Z�j~�+��u��OOS�CDC|C�B��
Y��F�ZX�XD�1�&���7��"�E�U�aS�L���>��D�H����x��t�&��{�����H��_~�?G@�N�������W�%��"ij����K��C�C�$����|�y��M�8����/�h9h� �'�O����'��������Z��A���g�j��~	�������)��7���r�c�_c�����
���X)��
&X��w=�K���N������d|G!D�E"�"KB�����HDb�r�D���}��f���=��������|8��z�)����!��G���qc��q��T�Z��Nt]X�5R�#��HH�`��s�/8Y��D��L�q�H<�EL#����8���,��L�1zD3"���p,�1�#�r�v��{��->=,c����?/y�rd��0�M�	����p��C��F
~���|$2Q"���s)�O�QA�"��	Y&"YY���o��p�6
��a���{VL�����q�X�.�\����B��J��Ig�������f��������	�~��'�������/�����Wfyq�3��(u-���N�:����p��;B���#},�e�A��������D��b
D��G4�/Q����f�-��	�]�;����(^�D>�������g�q���Ce�n�F��8'�\�F��44i���o�o���<��!�������r&���|B������e���!�"���a��1!�g�p���t�R?i�_�w�k�`��>��� �!ja�~�Z��I�r���
��o|+|7�J��O|��4�����L�^Ml���/���#��� ��r^&[�I�"���k��q�D?��q,~#+0��x
���>-y�>���|��h�H_R����'�����,r�'��
�g��b�3D=)����>�q,5
�4a�
+r?�o��F�_Kdi������HD����t"�1��zIF�6�t��13����8��3D"9���@����e��|8��|�4��$@I7�G:��'!1�'�\"��A"�"#���_F�$�0�B�LX�/cUKsy���oL��.]�W���<�E`l��g�;���D�1q���K�M
���|�N��D��o���7*T��-X���7y�M���+W��']���������|�?�~�W�n�����)���	���^DM�%?��H�\�B���>���87i"?8���!�����B���D>!D�$�|8B�B��"Qe8_D���[ms���gAq��&�!�qn�����N)��|���%���%o*����5pFq<9�^j�>��8��3��Y�fw�1���#"*�F^D=�Q�;��S��0,�%J/|F���l�H>�E��df�0���3�����0~��'Dl��'��� ��	A|0�."�������6-��K��,���@����������el��)���e�_�U9s����U�SH�t�v@����o���!��"Hr�H�
?-l�	������;��,[�?�~�G�+:���������\���S�;�B�"�!�}DS������{�7�B�?$�	!�$4~X������[�2e��%���hA|�I��av6���#E���#�`���Y��]�-Z�s�&]��(�YXf�k��=f��.�c_>�F�3������{�s������J�*^���eIK}I;��8�8��3�8���{
�1����a��Y\f�C�0���9[8����N3��i�B"�C�A��'\����|�F�R���BD�D>!DF1�	��v,W�\�={v���W>FZ"��W��m����������R[@����a�`&�#'�������'�b�g|���?D<cU����!�E��l���	�]�c��#!��3���v�%��R�j��`b���r�/y�=~��r^��z5
���}��B���D>!D����G'g��R��m���m8N8I�
(�3���2����a�'3�!
��A5�ff��pljpV�q��8|l���5���8|8e��~u�1��G���W�u�I������{������ad{�=#&2����8����k������������p,{��|������B��|B��~~���!D��{���9��N���/Vh\M�cI/��
��Y����-�_����H�]Z ������s>@���	">#+����>b�1������'l9_$K�"�A�C�"�Z���|@�c������3?�����,Y������7�����\��	d&�����-U���
!n-���*D���(���7�#" p�,���p�xDY��#H!b�D>!DF�j{���!N���	���[{Q��C\Cl������J�ercR��� "�#]
�.�@�/��wI����\x��f2A��k2��8�iEX���D/B y�����|�!=�{\/l��q�C���x��s}�V�B"����~b�O=�_�|B�O$�	!D���B����I
BQ|��3[kQ���@��^$�	!22�D��~�.��`��BBB[�`D��%��R���h@�
�9�c��X��V, �a��r,�Bd��%��-S���:e�?8b_�>%@Z���`�j��S/��@��!�|��l��g��|FZ��Y��\�@e�����/!}�F�D���'D�D"�B�������X��#���^|�����3�lt��]!D�H�Bdd�����b�z�!D�����������^w�a��2xQB�^�JF�=$����a�}�8'KU�O���K�=��C8�q^���<hgY��v$LZ���f���w��q����}Db���G��[$������l��s�3~�L����/��+E��~�{|F�!Q��D��z��\������I���fDU��ks.�L�U���H�H�B�af�`���a���W�x��B��|B����/#J�R�J~�8D��"3��#��H=����+��2�1�B�*D,>���8//C��p\j�A�|����87�p��c�g�����7^:� H���
6��e�.����4 �������#>���U���CD ~��5���~}�q��$R�HB^����t��1�!�Ex����%����|B!�Y�|B�1A�d_�������#��H�`���D7��>!D�D"�B!D@"�BdL�%x���F�~��yaFx�F,@8`�.��!�O����'�B���'�^�����w5���}�X�q����za�//0`���"�"�O!�" �O!2.��lD�]
���#� �����\����){�	!�'��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��B!���B!27��?`��e��Gk���5i��6lh���]�v%��l���
d}���#G�$z��<y�V�Ze[�n����'�6R��]k7n�?W�\i;v��+W�$_�?��w���O������M�6V�N����>|�._�l.�+Vx�!������[�d�M�4���o�R��F�����^`���6g�;v����7��I� M�����B�g$�	!�Bdn$�	!����L����c���Z�nmS�L�"���37c�k���
:������l���^l<}���:u��Di����g�����"B��={&�!�,X���Wjo�:D��]�z��z��C_w���E<�8�?�u�Y�f^����:u���\�g��M��w?~���|��x�b/'
� #i��B�W$�	!�Bdn$�	!������m��	�T��8p����1�=�E��#G&r�:t��B�:'N��/����.]�������m�?�V�+�;�3i��<x�y����UGT�+y5{�l/4���>}�����e�u�������"�R�n%�������/!������-_��O,a��������`���~�z���>�s\k�4���{����g������Y9��%�*�9s�-Z��O�SV�U���K���L��V�g��M���.E�+���;���y�f_����'Z��VLq}H�"#s���m� �[�	c�;�� �!�DF�1�i5jTJ�OG�����"<�?�/�8,8��o��oD����W����m��Y>�+58(,�� Hr�H��6��&�1������K�s+D��s�&�������%Wg��_�.����<�=:�!�}��^` �Q4����:G�A�c� q-���r�P��
�eY���H�q�|��#��<�H��!C|DbZN���|Bdl�5*T�`E��2e�X�����V3���EB�Z�zuk���
����f�"!�����X�DD���Y�$>*���P�O��U�X�bV�re_W�M���Ob�<@����$9+/*V���W�lY_�J�,����FB��6l����N"`��[B��F"��+�:'k�l;���l��������v~��$��&�OK�f�F���'23��sg��7��/_�����7��R�JY�����_����'O��?�wH{8'��"NN3�<�8k��������7(e��k�h�nf+���q]�'�\�v��~�1R�A��X����85����N�.\��M�C�u8q������p�|��a+=�p�b+?��=V������a�|6r�X;y6z��:C>�G��w��rF���_J��~z8�
�kD>���,�(8m�4?��	�g�Cd%iA���L�R��(<�3N.i��z��B�
Y�n�R�\$�E����g��^���M�K�Y���D>!26<�L�FF�M�8��c{����F_w3��"�Y���������!]�u���h�������j �!�A��-Z��y�/<����M}��a�~2e�z�j��1�M$_��]� u=��=B��#�O����'���Yv�Ka;���\������h��v~�|�r.I��Q~L�C���y@<A�`Vgq�
Aa�m����Dh�)/U�����8�\��&�b
�!����|�x����G�E
?A��x��������X�����$����e�cL��q���3���o�L��"�Q��/����Qf92NUp��&���T�#N�
�]<g�6.���
�'�>���J[��e���"�v�6p�p;t2����S�=��U�V^\E4�� �2K��%8��)�3�L�=WZ�����m:���z�?eM}C�� B�}��S�Sgq�������~���t�s~�����8�av<%�%��u�H�H�"c�sK��r�~�	������O����_�>�e��p^&�H����-�X�Y?����I�������.Lz��?�v��=�YK�ab�Z��)_�p��[5��/���B�8����\���P��v�o��u�s�'��e����v�S>��K��$}c��]M��j4��U�b�	BbB
�!���������h�, ��"Nj����HD>��6�4�� ��� ���+i�s����H7�s�H����cqX$qu�w��g9I;�#Hq\��M�8���a_� ]�~5��#��Z�!���n����:6m�`�sx��[��>�����^�f��|�pB�xqyJ�^��SI���8�\(�A��o�k���Qu|/�cG��
App��|N�#"�Q��-�9�
�5BT��(����.N%u�41P�c�)�����	���'D��|���Y8�w�k���~�s����,i�|����(}�&�D�OC�c��,����T��(P�����F��X�N���
�Y\�
��,������������^�n�G��|3Va����?�)���
����SVZ�3�9�.\��$.��\�(/�����-���*�n���=R�V�[�M�}E�)�S��VP����y|������*��Q�����A��
��	���A�~�V��+�������9�b�������Y!�����x������O�r��WQc�	!��D>!n%��v��]���.���.�������vjf;��M;=����|��]2.���O��n�Z�������y�����w�]>q���4�3&M���7�����Y����#��Q�F
/����"�� ��rN�'oNe��S���#��3�3)�q>��@	���`f"E>�&.8�������4{�FH��o8808P8�\7�"���'l������/m��u?j��o�1�'��u��vK�����7e� ��l�7���n�[��M�9�:��l��/m���g����.����OA`c�j3(!_q�����c�3����"eD�Pf8����5C�g��Na�=�0��A(��N,"5���@3���\o��H4��H��p>Zo;��;>����������{;��"��&�!��O!���Y�_�>�����TG��������sn�|'�j�"����[���^�0���K9'0��w����l�����CdY�������L�L�r=�n|�������
�"D��N�����}��mYi���|�Y������*VxHA��1��F������W�1��x����&!�?���ar�z�������T�b�j����(�l��*��
a����8L���?�����)��o�d���2�;�;�/�,����>����G@8���L����'�-����vrd#;P�q�W�n�_��k���u��e�bg��M���>����v|P
;X�9\Z�Hmw��Svrls�|���?�",N 3m�8ai���@�0��H�(���h'7�K54"�u��"�\K������pDp<pdqDh��&�q-���!b��0q^���8�`��� ��qD>�3Q"���G�sM���[����%n���G�����w��
]m��a6y�����5�[��h��������f?)�[{������E�o�>1���,�zDD����8;,�a"��D^<�F�k/�|8��9��?��T<E�8��Ybf�qH���587�����������a`C���	���'��rs���/f��S��bg
�s+&������G��s��'yc\M�CL�W�'�#���o���D-��@[����0�����y����������������/M�B
&�j��G�L�HK��d/���_E������C�J_�hD�V>C�b��#"T��_�s\��
Wn�^�\�>����/kl3�~X��--w�\V�w)��m�c�p��%u?��2��h�)[�P��O�o�����)>~�� ��H#�������0H��c�}>�\�?��&e��q��F=�o��qB8����41��B����'�-��C;m�G�p������uX��m���
_��3����m��u����������y������kj<e�?`c&L��~��[�~�0����S�3A���#�����g���BfqBh���e��
�$��q�CK$#���K8O�E>����q8>a�a�cF2,!�8��I����.�_���w������`wV��Z}��&������_�������������>_����|::��i����N�bp��s���[�D�Qc ���#�w	"���������D%-��s���|�{� KF8�k��L��N=�.	���'���>�6-O�H]:����w�{q;�1��]<*�s|5~~6���)g����s+�~�{W5w���W�Q�[�(����e��]�O���B��O�oC$�'~
��#!���	>
�db+,�%z�%R����� ���K-��?����#�'��b��m5�����8�~�����V�v&x��ro�-K������kg�����PW�W�K�X�O�=���K��T�{yK���|����������;6�s��!�G�����J�A�,�e|�����h��i�g|���Nd�^Zp��c#E>���<�c�s���"_�����|NZ��������������S'$�	��H���r��#�oo��mo�?\��������x�NM����|����f����q��}��M�����|G��������c��)D+����Y�F�s�c����}����[��"~ ��i#�p��$N'/O����8$�+���!G��g��!���{����B�f}����38�<a�G��E�\��r-�@�
�2�K��C(B�d��t��wpb�H��\+Q�����P�~������w��_��W���+�+{����p^��=�z�mcoty�~^�w�?��4������|?��j<f��O����y�e���G�r����f�y�K����?^���"�4Dmpq\��y�{��<�������{�1h!��2f0�xG�����m*8����P��9�P�87Bb"B����q.nYi����~����y�K��>����b�x;1�������n��w�2w��-���;����+�&���������
��a���P�>�*�}@��OG�I�H�G_K�P�h�?D��wt����� @��	�H���[1�x�����,~��)R���DG���+W���a��.L2��.�z��*?x]>��)��y��;������g�_E��7jk/�3 ��G���?��Y��Om���g��Ecu)`'����G��
������Y��X��?�q����B��_���O|���u����J����HK������!eD>���Ga�w���}@�L�W��S�B��C{,�O�[������O���Nvr\;9�����6vbt;����Q���,�d��{q;9��?&���e����u�_fW.]����{'�0z�=D:��(:r���-fF��hT��G
��s�������}:q�fw��3���H?>�|8�8�A��R�?	4�T��,�� ���H'�sq�b������;��X�"�7Qb�V"����9M��i������?�j��ou�4�����j�j���\�s[���X�ay-��O,{��VaH
�?�Y�����qVkTC?���PR�N�����S��!��$�q.#�}S_(O�P��It�D��&@��o��H���C�3����sH���^G����4b8�>�o�#K}F��Y������|B�8�VM�}���������l_��p�vzZ��>�������n��.�F���q�l[�|����}�w�����'�O��cB?_��
Q��������q<>}*�������1K�$/�������XJ�	���=}gj�?�4#��HR�|�|�"�]������H���."Q�|'��W����/�\��C���V�/VqBi��[J4��-=�����o�U2��]�X�q[������6�|�o�����@�����o���?����4eFy"�Rg��1� �L)���q=��:A�	��|��kj����Z��:�%�u�� B�����Kr,��'����W��^"���,��#����9����jd6�7�=������C�{���$��(Q��~0������c���n��\�7�X��6y��1t����A�3` ��&
W0�~��
q-$�	q�\����U���*i�pi��������gOO�jg?�n��~���^<���]9��/[��^�3gN����>�<�=<�e|����b	>|&��$/Z��=���G�#B��������eL�q��D>�:"���g�,@�&y9���H#pL���<(\��~8��=�L��&�����?��n�Q����R��!��Q�^+7�������n�c��V����u��Q���p��pc�3�VP>���2e�hD7��eZ�n���?�w���9r�HC���w�7������#�1D������KK��������&"��� � �
Po����H�Bdhh�p2pS;#�����Dqk@�c�!%�X�
OpT�H�H��&���]>s��Y�6+�e���������fv������������vq�j���4���������o��s�)�;XX�&��S�C�	���G���&(���+|�� ��;m�f\�1����%@"� |���"A��~r���i�'i��I;�c=�E�DD�Dp��9;t���9���������~W�����K���}��*���r�{���}��i}7m���?y.��`@^!��|I�q�)���y�^d�b5K�;����>���C��z�u
��>�e|���9���,��tD�u!��B�G"�"C�3�#Ac��I\3\_�:����h�8�a`"Dz�gD"�	��e���>��}�]�����w�Q$�9a��=����>���9y��}�g�}�a�-�������?��U �����������D����o�Qqu$�	!�Bd$�	!�
K{�3�Z/r�%L���/���~�=)���)_�[B\�|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���|B!�Y�|B!���L-�M�6�._��|�B!�Y������M����h�"�HB!���K�.�������&��7��9��Zf�e2�L&�����>|�V�^�}�9s������#���L&��d2�,c~��S�l���?���i	�z��m+W����7��)2�L&��dY���C�c+|��C�z��Y�!��d2�L&��F�]��e	�d2�L&��d2�L&��d2Y|L"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L��M"�L&��d2�L&��d2�L���c����ge����IEND�B`�

v1-0001-test-module-for-hex-coding 1.patchapplication/octet-stream; name="v1-0001-test-module-for-hex-coding 1.patch"Download

From 6a535299d62b36683b53f66d6b70de52cee4a0a7 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Mon, 16 Dec 2024 16:33:23 +0530
Subject: [PATCH v1] test module for hex coding

---
 src/test/modules/test_hex_parsing/Makefile    | 21 +++++
 .../test_hex_parsing--1.0.sql                 |  2 +
 .../test_hex_parsing/test_hex_parsing.c       | 86 +++++++++++++++++++
 .../test_hex_parsing/test_hex_parsing.control |  4 +
 4 files changed, 113 insertions(+)
 create mode 100644 src/test/modules/test_hex_parsing/Makefile
 create mode 100644 src/test/modules/test_hex_parsing/test_hex_parsing--1.0.sql
 create mode 100644 src/test/modules/test_hex_parsing/test_hex_parsing.c
 create mode 100644 src/test/modules/test_hex_parsing/test_hex_parsing.control

diff --git a/src/test/modules/test_hex_parsing/Makefile b/src/test/modules/test_hex_parsing/Makefile
new file mode 100644
index 0000000000..21a8ee416e
--- /dev/null
+++ b/src/test/modules/test_hex_parsing/Makefile
@@ -0,0 +1,21 @@
+MODULE_big = test_hex_parsing
+OBJS = test_hex_parsing.o
+PGFILEDESC = "test"
+EXTENSION = test_hex_parsing
+DATA = test_hex_parsing--1.0.sql
+
+first: all
+
+# needed?
+test_hex_parsing.o: test_hex_parsing.c
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_hex_parsing
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_hex_parsing/test_hex_parsing--1.0.sql b/src/test/modules/test_hex_parsing/test_hex_parsing--1.0.sql
new file mode 100644
index 0000000000..5676e895c2
--- /dev/null
+++ b/src/test/modules/test_hex_parsing/test_hex_parsing--1.0.sql
@@ -0,0 +1,2 @@
+CREATE FUNCTION hex_encode_test(count int, num int) RETURNS text AS 'test_hex_parsing' LANGUAGE C;
+CREATE FUNCTION hex_decode_test(count int, num int) RETURNS text AS 'test_hex_parsing' LANGUAGE C;
\ No newline at end of file
diff --git a/src/test/modules/test_hex_parsing/test_hex_parsing.c b/src/test/modules/test_hex_parsing/test_hex_parsing.c
new file mode 100644
index 0000000000..dea19f365e
--- /dev/null
+++ b/src/test/modules/test_hex_parsing/test_hex_parsing.c
@@ -0,0 +1,86 @@
+/* select hex_encode_test(1000000, 1024); */
+/* select hex_decode_test(1000000, 1024); */
+
+#include "postgres.h"
+#include "fmgr.h"
+#include "utils/builtins.h"
+
+PG_MODULE_MAGIC;
+
+#ifndef HEX_ENCODE
+#define HEX_ENCODE(src, len, dst) hex_encode(src, len, dst)
+#endif
+
+#ifndef HEX_DECODE
+#define HEX_DECODE(src, len, dst) hex_decode(src, len, dst)
+#endif
+
+#define TRUNCATE 1	/* if true output min(len, 8) characters */
+
+/*
+ * hex_encode_test(count int, len int) returns text
+ * count: the number of iterations
+ * len: the number of bytes to encode
+ */
+PG_FUNCTION_INFO_V1(hex_encode_test);
+Datum
+hex_encode_test(PG_FUNCTION_ARGS)
+{
+	/* DEADC0DEBAADF00DC001C0FFEE in bytes */
+	uint8		bytes[]	= {222, 173, 192, 222, 186, 173, 240, 13, 192, 1, 192, 255, 238};
+	int			count 	= PG_GETARG_INT32(0);
+	int			len		= PG_GETARG_INT32(1);
+	char	   *src		= palloc(len);
+	char	   *dst 	= palloc(len * 2 + 1);
+	uint64		encoded_len;
+
+	dst[len * 2] = '\0';
+
+	for (int i = 0; i < len; i++)
+		src[i] = bytes[i % 13];
+
+	while (count--)
+		encoded_len = HEX_ENCODE(src, len, dst);
+
+	if (TRUNCATE)
+		dst[Min(len, 8)] = '\0';
+
+	PG_RETURN_TEXT_P(cstring_to_text(dst));
+}
+
+/*
+ * hex_decode_test(count int, len int) returns text
+ * count: the number of iterations
+ * len: the number of hex digits to decode, len should be even
+ */
+PG_FUNCTION_INFO_V1(hex_decode_test);
+Datum
+hex_decode_test(PG_FUNCTION_ARGS)
+{
+	char	   *hex_chr	= "DEADC0DEBAADF00DC001C0FFEE";
+	int			count 	= PG_GETARG_INT32(0);
+	int			len		= PG_GETARG_INT32(1);
+	int 		encode_len;
+	char	   *encoded;
+	char	   *src		= palloc(len);
+	char	   *dst		= palloc(len / 2);
+	uint64		decoded_len;
+
+	for (int i = 0; i < len; i++)
+		src[i] = hex_chr[i % 26];
+
+	while (count--)
+		decoded_len = HEX_DECODE(src, len, dst);
+
+	/* convert back to hex for printing */
+	if(TRUNCATE)
+		encode_len = Min(len / 2, 4);
+	else
+		encode_len = len / 2;
+
+	encoded = palloc(encode_len * 2 + 1);
+	encoded[encode_len * 2] = '\0';
+
+	HEX_ENCODE(dst, encode_len, encoded);
+	PG_RETURN_TEXT_P(cstring_to_text(encoded));
+}
diff --git a/src/test/modules/test_hex_parsing/test_hex_parsing.control b/src/test/modules/test_hex_parsing/test_hex_parsing.control
new file mode 100644
index 0000000000..c48ee81998
--- /dev/null
+++ b/src/test/modules/test_hex_parsing/test_hex_parsing.control
@@ -0,0 +1,4 @@
+comment = 'test'
+default_version = '1.0'
+module_pathname = '$libdir/test_hex_parsing'
+relocatable = true
-- 
2.34.1

v1-0001-SVE-support-for-hex-encode-and-hex-decode.patchapplication/octet-stream; name=v1-0001-SVE-support-for-hex-encode-and-hex-decode.patchDownload

From 45c9e42b317eb8d53d37536253c673db5362d775 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Thu, 9 Jan 2025 12:22:00 +0530
Subject: [PATCH v1] SVE support for hex encode and hex decode

---
 config/c-compiler.m4           |  53 ++++++++
 configure                      |  63 ++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  47 +++++++
 src/backend/utils/adt/encode.c | 222 +++++++++++++++++++++++++++++++++
 src/include/pg_config.h.in     |   3 +
 6 files changed, 397 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..bb22ceed17 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,56 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE intrinsic required for hex coding:
+# svld1, svtbl, svsel, etc.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[
+  AC_CACHE_CHECK([for svtbl, svlsr_z, svand_z, svcreate2, svst2, svsel and svget2 intrinsics],
+                 [pgac_cv_arm_sve_hex_intrinsics],
+  [
+
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>],
+    #if defined(__has_attribute) && __has_attribute (target)
+      __attribute__((target("arch=armv8-a+sve")))
+    #endif
+
+    [
+      char input[64] = {0};
+      char output[64] = {0};
+      svbool_t pred = svptrue_b8(), cmp1, cmp2;
+      svuint8_t bytes, hextbl_vec;
+      svuint8x2_t	merged;
+
+      /* intrinsics used in hex_encode_sve */
+      hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+      bytes = svld1(pred, (uint8_t *) input);
+      bytes = svlsr_z(pred, bytes, 4);
+      bytes = svand_z(pred, bytes, 0xF);
+      merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+      svst2(pred, (uint8_t *) output, merged);
+
+      /* intrinsics used in hex_decode_sve */
+      bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+      bytes = svsub_x(pred, bytes, bytes);
+      cmp1 = svcmplt(pred, bytes, 0);
+      cmp2 = svcmpgt(pred, bytes, 0);
+      bytes = svsel(svnot_z(pred, svand_z(pred, cmp1, cmp2)), bytes, bytes);
+      svst1(pred, output, bytes);
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0] == 0;
+    ])],
+    [pgac_cv_arm_sve_hex_intrinsics=yes],
+    [pgac_cv_arm_sve_hex_intrinsics=no])
+
+  ])
+
+  if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+    pgac_arm_sve_hex_intrinsics = yes
+  fi
+])
diff --git a/configure b/configure
index a0b5e10ca3..7e0c0e4c05 100755
--- a/configure
+++ b/configure
@@ -17159,6 +17159,69 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE intrinsic svtbl, svlsr_z, etc." >&5
+  $as_echo_n "checking for SVE intrinsic svtbl, svlsr_z... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+    $as_echo_n "(cached) " >&6
+else
+    cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int
+main ()
+{
+    char input[64] = {0};
+    char output[64] = {0};
+    svbool_t pred = svptrue_b8(), cmp1, cmp2;
+    svuint8_t bytes, hextbl_vec;
+    svuint8x2_t	merged;
+
+    /* intrinsics used in hex_encode_sve */
+    hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+    bytes = svld1(pred, (uint8_t *) input);
+    bytes = svlsr_z(pred, bytes, 4);
+    bytes = svand_z(pred, bytes, 0xF);
+    merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+		svst2(pred, (uint8_t *) output, merged);
+
+    /* intrinsics used in hex_decode_sve */
+    bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+    bytes = svsub_x(pred, bytes, bytes);
+    cmp1 = svcmplt(pred, bytes, 0);
+    cmp2 = svcmpgt(pred, bytes, 0);
+    bytes = svsel(svnot_z(pred, svand_z(pred, cmp1, cmp2)), bytes, bytes);
+    svst1(pred, output, bytes);
+
+    /* return computed value, to prevent the above being optimized away */
+    return output[0] == 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS=yes
+fi
+
+if test x"$PGAC_ARM_SVE_HEX_INTRINSICS" = x"yes"; then
+  $as_echo "#define USE_SVE_WITH_RUNTIME_CHECK 1" >>confdefs.h
+fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index d713360f34..cc805667b9 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$PGAC_ARM_SVE_HEX_INTRINSICS" = x"yes"; then
+    AC_DEFINE(USE_SVE_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index cfd654d291..a0ee05bad0 100644
--- a/meson.build
+++ b/meson.build
@@ -2194,6 +2194,53 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of ARM SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    char input[64] = {0};
+    char output[64] = {0};
+    svbool_t pred = svptrue_b8(), cmp1, cmp2;
+    svuint8_t bytes, hextbl_vec;
+    svuint8x2_t	merged;
+
+    /* intrinsics used in hex_encode_sve */
+    hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+    bytes = svld1(pred, (uint8_t *) input);
+    bytes = svlsr_z(pred, bytes, 4);
+    bytes = svand_z(pred, bytes, 0xF);
+    merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+		svst2(pred, (uint8_t *) output, merged);
+
+    /* intrinsics used in hex_decode_sve */
+    bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+    bytes = svsub_x(pred, bytes, bytes);
+    cmp1 = svcmplt(pred, bytes, 0);
+    cmp2 = svcmpgt(pred, bytes, 0);
+    bytes = svsel(svnot_z(pred, svand_z(pred, cmp1, cmp2)), bytes, bytes);
+    svst1(pred, output, bytes);
+
+    /* return computed value, to prevent the above being optimized away */
+    return output[0] == 0;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE hex encoding', args: test_c_args)
+    cdata.set('USE_SVE_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4a6fcb56cd..b4a78cc4e4 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -20,6 +20,10 @@
 #include "utils/memutils.h"
 #include "varatt.h"
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+#include <sys/auxv.h>
+#include <arm_sve.h>
+#endif
 
 /*
  * Encoding conversion API.
@@ -158,8 +162,106 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+static uint64 hex_encode_slow(const char *src, size_t len, char *dst);
+static uint64 hex_decode_slow(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_slow(const char *src, size_t len, char *dst,
+								   Node *escontext);
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst,
+								  Node *escontext);
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst,
+									 Node *escontext);
+uint64 (*hex_encode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 (*hex_decode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 (*hex_decode_safe_optimized)
+	   (const char *src, size_t len, char *dst, Node *escontext) =
+		hex_decode_safe_choose;
+
+/*
+ * Returns true if the CPU supports SVE instructions.
+ */
+static inline bool
+check_sve_support(void)
+{
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_slow;
+		hex_decode_optimized = hex_decode_slow;
+		hex_decode_safe_optimized = hex_decode_safe_slow;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+
 uint64
 hex_encode(const char *src, size_t len, char *dst)
+{
+	if (len < 16)
+		return hex_encode_slow(src, len, dst);
+	return hex_encode_optimized(src, len, dst);
+}
+
+uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+	if (len < 32)
+		return hex_decode_slow(src, len, dst);
+	return hex_decode_optimized(src, len, dst);
+}
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+	if (len < 32)
+		return hex_decode_safe_slow(src, len, dst, escontext);
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+uint64
+hex_encode_slow(const char *src, size_t len, char *dst)
+#else
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+#endif
 {
 	const char *end = src + len;
 
@@ -186,14 +288,24 @@ get_hex(const char *cp, char *out)
 	return (res >= 0);
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+uint64
+hex_decode_slow(const char *src, size_t len, char *dst)
+#else
 uint64
 hex_decode(const char *src, size_t len, char *dst)
+#endif
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+uint64
+hex_decode_safe_slow(const char *src, size_t len, char *dst, Node *escontext)
+#else
 uint64
 hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+#endif
 {
 	const char *s,
 			   *srcend;
@@ -233,6 +345,116 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+/*
+ * ARM SVE implementation of hex_encode and hex_decode.
+ */
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	svbool_t	pred;
+	svuint8_t	bytes,
+				high,
+				low,
+				hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svuint8x2_t	merged;
+	uint32 		vec_len = svcntb();
+
+	for (size_t i = 0; i < len; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, len);
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_z(pred, bytes, 4);		/* high nibble of the byte */
+		low = svand_z(pred, bytes, 0xF);	/* low nibble of the byte */
+
+		/* merge the high and low nibbles after converting to hex and */
+		merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, merged);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	svuint8_t	dgt_vec = svsub_x(pred, vec, 48),
+				cap_vec = svsub_x(pred, vec, 55),
+				sml_vec = svsub_x(pred, vec, 87),
+				alpha_vec;
+	svbool_t	dgt_bool = svcmplt(pred, dgt_vec, 10),
+				cap_bool = svcmplt(pred, cap_vec, 16),
+				valid_alpha;
+
+	alpha_vec = svsel(cap_bool, cap_vec, sml_vec);
+	valid_alpha = svand_z(pred, svcmpgt(pred, alpha_vec, 9),
+								svcmplt(pred, alpha_vec, 16));
+
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, dgt_bool, valid_alpha))))
+		return false;	/* invalid hex digit */
+
+	*res = svsel(dgt_bool, dgt_vec, alpha_vec);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	svbool_t	pred;
+	svuint8x2_t	bytes;
+	svuint8_t	high,
+				low;
+	uint32		processed;
+	size_t		i = 0,
+				loop_bytes = len & ~31;
+	const char *p = dst;
+
+	while (i < loop_bytes)
+	{
+		pred = svwhilelt_b8(i / 2, len / 2);
+		bytes = svld2(pred, (uint8 *) src);
+		high = svget2(bytes, 0);	/* hex digit for high nibble */
+		low = svget2(bytes, 1);		/* hex digit for low nibble */
+
+		/* fall back if ASCII less than '0' is found */
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'),
+											svcmplt(pred, low, '0'))))
+			break;
+
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+			ereturn(escontext, 0, (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+								   errmsg("invalid hexadecimal digit")));
+
+		/* combine high and low nibble to form the byte and store in dst */
+		svst1(pred, (uint8 *) dst, svorr_x(pred, svlsl_x(pred, high, 4), low));
+
+		processed = svcntp_b8(pred, pred) * 2;
+		src += processed;
+		i += processed;
+		dst += processed / 2;
+	}
+
+	if (i < len)	/* fall back */
+		return dst - p + hex_decode_safe_slow(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..b5096c11f4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
-- 
2.34.1

Nathan Bossart

nathandbossart@gmail.com

about 1 year ago

In reply to: Devanga.Susmitha@fujitsu.com (#1)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Thu, Jan 09, 2025 at 11:22:05AM +0000, Devanga.Susmitha@fujitsu.com wrote:

This email aims to discuss the contribution of optimized hex_encode and
hex_decode functions for ARM (aarch64) machines. These functions are
widely used for encoding and decoding binary data in the bytea data type.

Thank you for sharing this work! I'm not able to review this in depth at
the moment, but I am curious if you considered trying to enable
auto-vectorization on the code or using the higher-level SIMD support in
src/include/port/simd.h. Those may not show as impressive of gains as your
patch, but they would likely require much less code and apply to a wider
set of architectures.

--
nathan

Chiranmoy.Bhattacharya@fujitsu.com

about 1 year ago

In reply to: Nathan Bossart (#2)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Hello Nathan,

We tried auto-vectorization and observed no performance improvement.
The instructions in src/include/port/simd.h are based on older SIMD architectures like NEON, whereas the patch uses the newer SVE, so some of the instructions used in the patch may not have direct equivalents in NEON. We will check the feasibility of integrating SVE in "src/include/port/simd.h" and get back to you.
The actual encoding/decoding implementation takes less than 100 lines. The rest of the code is related to config and the "choose" logic. One option is to move the implementation to a new file, making src/backend/utils/adt/encode.c less bloated.

Thanks,
Chiranmoy

Nathan Bossart

nathandbossart@gmail.com

about 1 year ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#3)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Jan 10, 2025 at 11:10:03AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

We tried auto-vectorization and observed no performance improvement.

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

The instructions in src/include/port/simd.h are based on older SIMD
architectures like NEON, whereas the patch uses the newer SVE, so some of
the instructions used in the patch may not have direct equivalents in
NEON. We will check the feasibility of integrating SVE in
"src/include/port/simd.h" and get back to you.

Thanks!

--
nathan

Nathan Bossart

nathandbossart@gmail.com

about 1 year ago

In reply to: Nathan Bossart (#4)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:

On Fri, Jan 10, 2025 at 11:10:03AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

We tried auto-vectorization and observed no performance improvement.

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

I was able to get auto-vectorization to take effect on Apple clang 16 with
the following addition to src/backend/utils/adt/Makefile:

encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8

This gave the following results with your hex_encode_test() function:

buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 21 | 16 | 24
64 | 54 | 41 | 24
256 | 138 | 100 | 28
1024 | 441 | 300 | 32
4096 | 1671 | 1106 | 34
16384 | 6890 | 4570 | 34
65536 | 27393 | 18054 | 34

This doesn't compare with the gains you are claiming to see with
intrinsics, but it's not bad for a one line change. I bet there are ways
to adjust the code so that the auto-vectorization is more effective, too.

--
nathan

Chiranmoy.Bhattacharya@fujitsu.com

12 months ago

In reply to: Nathan Bossart (#5)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:

Do you mean that the auto-vectorization worked and you observed no
performance improvement, or the auto-vectorization had no effect on the
code generated?

Auto-vectorization is working now with the following addition on Graviton 3 (m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, auto-vectorization had no effect because we missed the -march=native option.

encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native

There is a 30% improvement using auto-vectorization.

buf | default | auto_vec | SVE
--------+-------+--------+-------
16 | 16 | 12 | 8
64 | 58 | 40 | 9
256 | 223 | 152 | 18
1024 | 934 | 613 | 54
4096 | 3533 | 2430 | 202
16384 | 14081 | 9831 | 800
65536 | 56374 | 38702 | 3202

Auto-vectorization had no effect on hex_decode due to the presence of control flow.

-----
Here is a comment snippet from src/include/port/simd.h

"While Neon support is technically optional for aarch64, it appears that all available 64-bit hardware does have it."

Currently, it is assumed that all aarch64 machine support NEON, but for newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We need a runtime check to be sure.. Using src/include/port/simd.h to abstract away these advanced SIMD implementations may be difficult.

We will update the thread once a solution is found.

-----
Chiranmoy

Nathan Bossart

nathandbossart@gmail.com

12 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#6)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Mon, Jan 13, 2025 at 03:48:49PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

There is a 30% improvement using auto-vectorization.

It might be worth enabling auto-vectorization independently of any patches
that use intrinsics, then.

Currently, it is assumed that all aarch64 machine support NEON, but for
newer advanced SIMD like SVE (and AVX512 for x86) this assumption may not
hold. We need a runtime check to be sure.. Using src/include/port/simd.h
to abstract away these advanced SIMD implementations may be difficult.

Yeah, moving simd.h to anything beyond Neon/SSE2 might be tricky at the
moment. Besides the need for additional runtime checks, using wider
registers can mean that you need more data before an optimization takes
effect, which is effectively a regression. I ran into this when I tried to
add AVX2 support to simd.h [0]/messages/by-id/20231129171526.GA857928@nathanxps13. My question about using simd.h was
ultimately about abstracting the relevant Neon/SSE2 instructions and using
those for hex_encode/decode(). If that's possible, I think it'd be
interesting to see how that compares to the SVE version.

[0]: /messages/by-id/20231129171526.GA857928@nathanxps13

--
nathan

John Naylor

johncnaylorls@gmail.com

12 months ago

In reply to: Nathan Bossart (#5)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Sat, Jan 11, 2025 at 3:46 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

I was able to get auto-vectorization to take effect on Apple clang 16 with
the following addition to src/backend/utils/adt/Makefile:

encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -mllvm -force-vector-width=8

This gave the following results with your hex_encode_test() function:

buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 21 | 16 | 24
64 | 54 | 41 | 24
256 | 138 | 100 | 28
1024 | 441 | 300 | 32
4096 | 1671 | 1106 | 34
16384 | 6890 | 4570 | 34
65536 | 27393 | 18054 | 34

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

select hex_encode_test(1000000, 1024);
master:
Time: 1158.700 ms
v2:
Time: 777.443 ms

If we need to do much better than this, it seems better to send the
data to the client as binary, if possible.

--
John Naylor
Amazon Web Services

Attachments:

v2-byte-lookup.patchtext/x-patch; charset=US-ASCII; name=v2-byte-lookup.patchDownload

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4a6fcb56cd..8b059bc834 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -145,7 +145,7 @@ binary_decode(PG_FUNCTION_ARGS)
  * HEX
  */
 
-static const char hextbl[] = "0123456789abcdef";
+static const char hextbl[512] = "000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
 
 static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -165,9 +165,8 @@ hex_encode(const char *src, size_t len, char *dst)
 
 	while (src < end)
 	{
-		*dst++ = hextbl[(*src >> 4) & 0xF];
-		*dst++ = hextbl[*src & 0xF];
-		src++;
+		memcpy(dst, &hextbl[(* ((unsigned char *) src)) * 2], 2);
+		src++; dst+=2;
 	}
 	return (uint64) len * 2;
 }

Michael Paquier

michael@paquier.xyz

12 months ago

In reply to: John Naylor (#8)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Tue, Jan 14, 2025 at 12:27:30PM +0700, John Naylor wrote:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

select hex_encode_test(1000000, 1024);
master:
Time: 1158.700 ms
v2:
Time: 777.443 ms

If we need to do much better than this, it seems better to send the
data to the client as binary, if possible.

That's pretty cool. Complex to parse, still really cool.
--
Michael

#10

Tom Lane

tgl@sss.pgh.pa.us

12 months ago

In reply to: John Naylor (#8)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

John Naylor <johncnaylorls@gmail.com> writes:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion. I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

regards, tom lane

#11

Nathan Bossart

nathandbossart@gmail.com

12 months ago

In reply to: Tom Lane (#10)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote:

John Naylor <johncnaylorls@gmail.com> writes:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

Nice. I tried enabling auto-vectorization and loop unrolling on top of
this patch, and the numbers looked the same. I think we'd need CPU
intrinsics or an even bigger lookup table to do any better.

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion. I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

--
nathan

#12

John Naylor

johncnaylorls@gmail.com

12 months ago

In reply to: Nathan Bossart (#11)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Tue, Jan 14, 2025 at 11:57 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Tue, Jan 14, 2025 at 12:59:04AM -0500, Tom Lane wrote:

John Naylor <johncnaylorls@gmail.com> writes:

We can do about as well simply by changing the nibble lookup to a byte
lookup, which works on every compiler and architecture:

Nice. I tried enabling auto-vectorization and loop unrolling on top of
this patch, and the numbers looked the same. I think we'd need CPU
intrinsics or an even bigger lookup table to do any better.

Thanks for looking further! Yeah, I like that the table is still only 512 bytes.

I didn't attempt to verify your patch, but I do prefer addressing
this issue in a machine-independent fashion. I also like the brevity
of the patch (though it could do with some comments perhaps, not that
the existing code has any).

+1

Okay, I added a comment. I also agree with Michael that my quick
one-off was a bit hard to read so I've cleaned it up a bit. I plan to
commit the attached by Friday, along with any bikeshedding that
happens by then.

--
John Naylor
Amazon Web Services

Attachments:

v3-0001-Speed-up-hex_encode-with-bytewise-lookup.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Speed-up-hex_encode-with-bytewise-lookup.patchDownload

From a62aea5fdbfbd215435ddc4c294897caa292b6f7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 15 Jan 2025 13:28:26 +0700
Subject: [PATCH v3] Speed up hex_encode with bytewise lookup

Previously, hex_encode looked up each nibble of the input
separately. We now use a larger lookup table containing the two-byte
encoding of every possible input byte, resulting in a 1/3 reduction
in encoding time.

Reviewed by Michael Paquier, Tom Lane, and Nathan Bossart

Discussion: https://postgr.es/m/CANWCAZZvXuJMgqMN4u068Yqa19CEjS31tQKZp_qFFFbgYfaXqQ%40mail.gmail.com
---
 src/backend/utils/adt/encode.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4a6fcb56cd..7fee154b0d 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -145,7 +145,23 @@ binary_decode(PG_FUNCTION_ARGS)
  * HEX
  */
 
-static const char hextbl[] = "0123456789abcdef";
+static const char hextbl[512] =
+"000102030405060708090a0b0c0d0e0f"
+"101112131415161718191a1b1c1d1e1f"
+"202122232425262728292a2b2c2d2e2f"
+"303132333435363738393a3b3c3d3e3f"
+"404142434445464748494a4b4c4d4e4f"
+"505152535455565758595a5b5c5d5e5f"
+"606162636465666768696a6b6c6d6e6f"
+"707172737475767778797a7b7c7d7e7f"
+"808182838485868788898a8b8c8d8e8f"
+"909192939495969798999a9b9c9d9e9f"
+"a0a1a2a3a4a5a6a7a8a9aaabacadaeaf"
+"b0b1b2b3b4b5b6b7b8b9babbbcbdbebf"
+"c0c1c2c3c4c5c6c7c8c9cacbcccdcecf"
+"d0d1d2d3d4d5d6d7d8d9dadbdcdddedf"
+"e0e1e2e3e4e5e6e7e8e9eaebecedeeef"
+"f0f1f2f3f4f5f6f7f8f9fafbfcfdfeff";
 
 static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -165,9 +181,16 @@ hex_encode(const char *src, size_t len, char *dst)
 
 	while (src < end)
 	{
-		*dst++ = hextbl[(*src >> 4) & 0xF];
-		*dst++ = hextbl[*src & 0xF];
+		unsigned char usrc = *((unsigned char *) src);
+
+		/*
+		 * Each input byte results in two output bytes, so we use the unsigned
+		 * input byte multiplied by two as the lookup key.
+		 */
+		memcpy(dst, &hextbl[2 * usrc], 2);
+
 		src++;
+		dst += 2;
 	}
 	return (uint64) len * 2;
 }
-- 
2.47.1

#13

Tom Lane

tgl@sss.pgh.pa.us

12 months ago

In reply to: John Naylor (#12)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

John Naylor <johncnaylorls@gmail.com> writes:

Okay, I added a comment. I also agree with Michael that my quick
one-off was a bit hard to read so I've cleaned it up a bit. I plan to
commit the attached by Friday, along with any bikeshedding that
happens by then.

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
* The hex expansion of each possible byte value (two chars per value).
*/

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+ unsigned char usrc = *((unsigned char *) src);

results in casting away const. Recommend

+ unsigned char usrc = *((const unsigned char *) src);

3. I really wonder if

+ memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of

+ *dst++ = hextbl[2 * usrc];
+ *dst++ = hextbl[2 * usrc + 1];

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

A variant could be

+		const char *hexptr = &hextbl[2 * usrc];
+		*dst++ = hexptr[0];
+		*dst++ = hexptr[1];

but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.

regards, tom lane

#14

John Naylor

johncnaylorls@gmail.com

12 months ago

In reply to: Tom Lane (#13)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Jan 15, 2025 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
* The hex expansion of each possible byte value (two chars per value).
*/

Works for me. With that, did you mean we then wouldn't need a comment
in the code?

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+ unsigned char usrc = *((unsigned char *) src);

results in casting away const. Recommend

+ unsigned char usrc = *((const unsigned char *) src);

Thanks for the reminder!

3. I really wonder if

+ memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of
+               *dst++ = hextbl[2 * usrc];
+               *dst++ = hextbl[2 * usrc + 1];
Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?

A variant could be
+               const char *hexptr = &hextbl[2 * usrc];
+               *dst++ = hexptr[0];
+               *dst++ = hexptr[1];
but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.

This combines to a single word starting with clang 5, but does not
work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and
on my machine bytewise load/stores are somewhere in the middle:

master 1158.969 ms
v3 776.791 ms
variant 4A 775.777 ms
variant 4B 969.945 ms

https://godbolt.org/z/ajToordKq

--
John Naylor
Amazon Web Services

#15

Ranier Vilela

ranier.vf@gmail.com

12 months ago

In reply to: John Naylor (#14)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Hi.

Em qua., 15 de jan. de 2025 às 07:57, John Naylor <johncnaylorls@gmail.com>
escreveu:

On Wed, Jan 15, 2025 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Couple of thoughts:

1. I was actually hoping for a comment on the constant's definition,
perhaps along the lines of

/*
* The hex expansion of each possible byte value (two chars per value).
*/

Works for me. With that, did you mean we then wouldn't need a comment
in the code?

2. Since "src" is defined as "const char *", I'm pretty sure that
pickier compilers will complain that

+ unsigned char usrc = *((unsigned char *) src);

results in casting away const. Recommend

+ unsigned char usrc = *((const unsigned char *) src);

Thanks for the reminder!
3. I really wonder if

+ memcpy(dst, &hextbl[2 * usrc], 2);

is faster than copying the two bytes manually, along the lines of
+               *dst++ = hextbl[2 * usrc];
+               *dst++ = hextbl[2 * usrc + 1];
Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.
See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?
A variant could be
+               const char *hexptr = &hextbl[2 * usrc];
+               *dst++ = hexptr[0];
+               *dst++ = hexptr[1];
but this supposes that the compiler fails to see the common
subexpression in the other formulation, which I believe
most modern compilers will see.
This combines to a single word starting with clang 5, but does not
work on gcc 14.2 or gcc trunk (4B below). I have gcc 14.2 handy, and
on my machine bytewise load/stores are somewhere in the middle:

master 1158.969 ms
v3 776.791 ms
variant 4A 775.777 ms
variant 4B 969.945 ms

https://godbolt.org/z/ajToordKq

Your example from godbolt, has a
have an important difference, which modifies the assembler result.

-static const char hextbl[] =
"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"
;
+static const char hextbl[512] =
"000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff"
;

best regards,
Ranier Vilela

#16

David Rowley

dgrowleyml@gmail.com

12 months ago

In reply to: John Naylor (#14)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, 15 Jan 2025 at 23:57, John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Jan 15, 2025 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Compilers that inline memcpy() may arrive at the same machine code,
but why rely on the compiler to make that optimization? If the
compiler fails to do so, an out-of-line memcpy() call will surely
be a loser.

See measurements at the end. As for compilers, gcc 3.4.6 and clang
3.0.0 can inline the memcpy. The manual copy above only gets combined
to a single word starting with gcc 12 and clang 15, and latest MSVC
still can't do it (4A in the godbolt link below). Are there any
buildfarm animals around that may not inline memcpy for word-sized
input?
A variant could be
+               const char *hexptr = &hextbl[2 * usrc];
+               *dst++ = hexptr[0];
+               *dst++ = hexptr[1];

I'd personally much rather see us using memcpy() for this sort of
stuff. If the compiler is too braindead to inline tiny
constant-and-power-of-two-sized memcpys then we'd probably also have
plenty of other performance issues with that compiler already. I don't
think contorting the code into something less human-readable and
something the compiler may struggle even more to optimise is a good
idea. The nieve way to implement the above requires two MOVs of
single bytes and two increments of dst. I imagine it's easier for the
compiler to inline a small constant-sized memcpy() than to figure out
that it's safe to implement the above with a single word-sized MOV
rather than two byte-sized MOVs due to the "dst++" in between the two.

I agree that the evidence you (John) gathered is enough reason to use memcpy().

David

#17

Tom Lane

tgl@sss.pgh.pa.us

12 months ago

In reply to: David Rowley (#16)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

David Rowley <dgrowleyml@gmail.com> writes:

I agree that the evidence you (John) gathered is enough reason to use memcpy().

Okay ... doesn't quite match my intuition, but intuition is a poor
guide to such things.

regards, tom lane

#18

Nathan Bossart

nathandbossart@gmail.com

12 months ago

In reply to: Tom Lane (#17)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

With commit e24d770 in place, I took a closer look at hex_decode(), and I
concluded that doing anything better without intrinsics would likely
require either a huge lookup table or something with complexity rivalling
the instrinsics approach (while also not rivalling its performance). So, I
took a closer look at the instrinsics patches and had the following
thoughts:

* The approach looks generally reasonable to me, but IMHO the code needs
much more commentary to explain how it works.

* The functions that test the length before potentially calling a function
pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
I wouldn't be surprised if some compilers are inlining this stuff
already, but it's probably worth being explicit about it.

* Finally, I think we should ensure we've established a really strong case
for this optimization. IME these intrinsics patches require a ton of
time and energy, and the code is often extremely complex. I would be
interested to see how your bytea test compares with the improvements
added in commit e24d770 and with sending the data in binary.

--
nathan

#19

Chiranmoy.Bhattacharya@fujitsu.com

12 months ago

In reply to: Nathan Bossart (#18)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

The approach looks generally reasonable to me, but IMHO the code needs

much more commentary to explain how it works.

Added comments to explain the SVE implementation.

I would be interested to see how your bytea test compares with the

improvements added in commit e24d770 and with sending the data in binary.

The following are the bytea test results with commit e24d770.
The same query and tables were used.

With commit e24d770:
Query exec time: 2.324 sec
hex_encode function time: 0.72 sec

Pre-commit e24d770:
Query exec time: 2.858 sec
hex_encode function time: 1.228 sec

SVE patch:
Query exec time: 1.654 sec
hex_encode_sve function time: 0.085 sec

The functions that test the length before potentially calling a function
pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
I wouldn't be surprised if some compilers are inlining this stuff
already, but it's probably worth being explicit about it.

Should we implement an inline function in "utils/builtins.h", similar to
pg_popcount()? Currently, we have not modified the header file, everything
is statically implemented in encode.c.

---
Chiranmoy

#20

Chiranmoy.Bhattacharya@fujitsu.com

12 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#19)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

I realized I didn't attach the patch.

Attachments:

v2-0001-SVE-support-for-hex-encode-and-hex-decode.patchapplication/octet-stream; name=v2-0001-SVE-support-for-hex-encode-and-hex-decode.patchDownload

From 2094bc7f60db93693f2c054e9044d8baa128bb8f Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Wed, 22 Jan 2025 15:52:40 +0530
Subject: [PATCH v2] SVE support for hex encode and hex decode

---
 config/c-compiler.m4           |  53 ++++++++
 configure                      |  63 +++++++++
 configure.ac                   |   9 ++
 meson.build                    |  47 +++++++
 src/backend/utils/adt/encode.c | 241 +++++++++++++++++++++++++++++++++
 src/include/pg_config.h.in     |   3 +
 6 files changed, 416 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..bb22ceed17 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,56 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE intrinsic required for hex coding:
+# svld1, svtbl, svsel, etc.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[
+  AC_CACHE_CHECK([for svtbl, svlsr_z, svand_z, svcreate2, svst2, svsel and svget2 intrinsics],
+                 [pgac_cv_arm_sve_hex_intrinsics],
+  [
+
+    AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>],
+    #if defined(__has_attribute) && __has_attribute (target)
+      __attribute__((target("arch=armv8-a+sve")))
+    #endif
+
+    [
+      char input[64] = {0};
+      char output[64] = {0};
+      svbool_t pred = svptrue_b8(), cmp1, cmp2;
+      svuint8_t bytes, hextbl_vec;
+      svuint8x2_t	merged;
+
+      /* intrinsics used in hex_encode_sve */
+      hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+      bytes = svld1(pred, (uint8_t *) input);
+      bytes = svlsr_z(pred, bytes, 4);
+      bytes = svand_z(pred, bytes, 0xF);
+      merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+      svst2(pred, (uint8_t *) output, merged);
+
+      /* intrinsics used in hex_decode_sve */
+      bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+      bytes = svsub_x(pred, bytes, bytes);
+      cmp1 = svcmplt(pred, bytes, 0);
+      cmp2 = svcmpgt(pred, bytes, 0);
+      bytes = svsel(svnot_z(pred, svand_z(pred, cmp1, cmp2)), bytes, bytes);
+      svst1(pred, output, bytes);
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0] == 0;
+    ])],
+    [pgac_cv_arm_sve_hex_intrinsics=yes],
+    [pgac_cv_arm_sve_hex_intrinsics=no])
+
+  ])
+
+  if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+    pgac_arm_sve_hex_intrinsics = yes
+  fi
+])
diff --git a/configure b/configure
index ceeef9b091..e634feec02 100755
--- a/configure
+++ b/configure
@@ -17168,6 +17168,69 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE intrinsic svtbl, svlsr_z, etc." >&5
+  $as_echo_n "checking for SVE intrinsic svtbl, svlsr_z... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+    $as_echo_n "(cached) " >&6
+else
+    cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute(target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int
+main ()
+{
+    char input[64] = {0};
+    char output[64] = {0};
+    svbool_t pred = svptrue_b8(), cmp1, cmp2;
+    svuint8_t bytes, hextbl_vec;
+    svuint8x2_t	merged;
+
+    /* intrinsics used in hex_encode_sve */
+    hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+    bytes = svld1(pred, (uint8_t *) input);
+    bytes = svlsr_z(pred, bytes, 4);
+    bytes = svand_z(pred, bytes, 0xF);
+    merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+		svst2(pred, (uint8_t *) output, merged);
+
+    /* intrinsics used in hex_decode_sve */
+    bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+    bytes = svsub_x(pred, bytes, bytes);
+    cmp1 = svcmplt(pred, bytes, 0);
+    cmp2 = svcmpgt(pred, bytes, 0);
+    bytes = svsel(svnot_z(pred, svand_z(pred, cmp1, cmp2)), bytes, bytes);
+    svst1(pred, output, bytes);
+
+    /* return computed value, to prevent the above being optimized away */
+    return output[0] == 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS=yes
+fi
+
+if test x"$PGAC_ARM_SVE_HEX_INTRINSICS" = x"yes"; then
+  $as_echo "#define USE_SVE_WITH_RUNTIME_CHECK 1" >>confdefs.h
+fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index d713360f34..cc805667b9 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$PGAC_ARM_SVE_HEX_INTRINSICS" = x"yes"; then
+    AC_DEFINE(USE_SVE_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 32fc89f3a4..d9d13b3c55 100644
--- a/meson.build
+++ b/meson.build
@@ -2194,6 +2194,53 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of ARM SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    char input[64] = {0};
+    char output[64] = {0};
+    svbool_t pred = svptrue_b8(), cmp1, cmp2;
+    svuint8_t bytes, hextbl_vec;
+    svuint8x2_t	merged;
+
+    /* intrinsics used in hex_encode_sve */
+    hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+    bytes = svld1(pred, (uint8_t *) input);
+    bytes = svlsr_z(pred, bytes, 4);
+    bytes = svand_z(pred, bytes, 0xF);
+    merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+		svst2(pred, (uint8_t *) output, merged);
+
+    /* intrinsics used in hex_decode_sve */
+    bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+    bytes = svsub_x(pred, bytes, bytes);
+    cmp1 = svcmplt(pred, bytes, 0);
+    cmp2 = svcmpgt(pred, bytes, 0);
+    bytes = svsel(svnot_z(pred, svand_z(pred, cmp1, cmp2)), bytes, bytes);
+    svst1(pred, output, bytes);
+
+    /* return computed value, to prevent the above being optimized away */
+    return output[0] == 0;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE hex encoding', args: test_c_args)
+    cdata.set('USE_SVE_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d..0fe41a8d00 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -20,6 +20,10 @@
 #include "utils/memutils.h"
 #include "varatt.h"
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+#include <sys/auxv.h>
+#include <arm_sve.h>
+#endif
 
 /*
  * Encoding conversion API.
@@ -177,8 +181,106 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+static uint64 hex_encode_slow(const char *src, size_t len, char *dst);
+static uint64 hex_decode_slow(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_slow(const char *src, size_t len, char *dst,
+								   Node *escontext);
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst,
+								  Node *escontext);
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst,
+									 Node *escontext);
+uint64 (*hex_encode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 (*hex_decode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 (*hex_decode_safe_optimized)
+	   (const char *src, size_t len, char *dst, Node *escontext) =
+		hex_decode_safe_choose;
+
+/*
+ * Returns true if the CPU supports SVE instructions.
+ */
+static inline bool
+check_sve_support(void)
+{
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_slow;
+		hex_decode_optimized = hex_decode_slow;
+		hex_decode_safe_optimized = hex_decode_safe_slow;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+	if (len < 16)
+		return hex_encode_slow(src, len, dst);
+	return hex_encode_optimized(src, len, dst);
+}
+
+uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+	if (len < 32)
+		return hex_decode_slow(src, len, dst);
+	return hex_decode_optimized(src, len, dst);
+}
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+	if (len < 32)
+		return hex_decode_safe_slow(src, len, dst, escontext);
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+uint64
+hex_encode_slow(const char *src, size_t len, char *dst)
+#else
 uint64
 hex_encode(const char *src, size_t len, char *dst)
+#endif
 {
 	const char *end = src + len;
 
@@ -207,14 +309,24 @@ get_hex(const char *cp, char *out)
 	return (res >= 0);
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+uint64
+hex_decode_slow(const char *src, size_t len, char *dst)
+#else
 uint64
 hex_decode(const char *src, size_t len, char *dst)
+#endif
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+uint64
+hex_decode_safe_slow(const char *src, size_t len, char *dst, Node *escontext)
+#else
 uint64
 hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+#endif
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +366,135 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+/*
+ * SVE implementation of hex_encode and hex_decode.
+ */
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	svbool_t	pred;
+	svuint8_t	bytes,
+				high,
+				low,
+				hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svuint8x2_t	merged;
+	uint32 		vec_len = svcntb();
+
+	for (size_t i = 0; i < len; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, len);
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_z(pred, bytes, 4);	/* shift-right to get the high nibble */
+		low = svand_z(pred, bytes, 0xF);   /* mask high to get the low nibble */
+
+		/*
+		 * Convert the nibbles to hex digits by indexing into hextbl_vec,
+		 * for example, a nibble value of 10 indexed into hextbl_vec gives 'a'.
+		 * Finally, interleave the high and low nibbles
+		 */
+		merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, merged);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	/*
+	 * Convert ASCII values '0'-'9' to integers 0-9 by subtracting 48.
+	 * Similarly, convert letters 'A'-'F' and 'a'-'f' to integers 10-15.
+	 */
+	svuint8_t	dgt_vec = svsub_x(pred, vec, 48),
+				cap_vec = svsub_x(pred, vec, 55),
+				sml_vec = svsub_x(pred, vec, 87),
+				letter_vec;
+	/*
+	 * Identify valid integers in dgt_vec, cap_vec, and sml_vec.
+	 * Values 0-9 are valid in dgt_vec, while values 10-15 are valid
+	 * in cap_vec and sml_vec.
+	 */
+	svbool_t	dgt_bool = svcmplt(pred, dgt_vec, 10),
+				cap_bool = svcmplt(pred, cap_vec, 16),
+				letter_bool;
+	/*
+	 * Combine cap_vec and sml_vec and mark the valid range 10-15.
+	 */
+	letter_vec = svsel(cap_bool, cap_vec, sml_vec);
+	letter_bool = svand_z(pred, svcmpgt(pred, letter_vec, 9),
+								svcmplt(pred, letter_vec, 16));
+	/*
+	 * Check for invalid hexadecimal digits. Each value must fall
+	 * within the range 0-9 (true in dgt_bool) or 10-15 (true in letter_bool).
+	 */
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, dgt_bool, letter_bool))))
+		return false;
+
+	/* Finally, combine dgt_vec and letter_vec */
+	*res = svsel(dgt_bool, dgt_vec, letter_vec);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	svbool_t	pred;
+	svuint8x2_t	bytes;
+	svuint8_t	high,
+				low;
+	uint32		processed;
+	size_t		i = 0,
+				loop_bytes = len & ~1;	/* handles inputs of odd length */
+	const char *p = dst;
+
+	while (i < loop_bytes)
+	{
+		pred = svwhilelt_b8(i / 2, len / 2);
+		bytes = svld2(pred, (uint8 *) src);
+		high = svget2(bytes, 0);	/* hex digit for high nibble */
+		low = svget2(bytes, 1);		/* hex digit for low nibble */
+
+		/* fall back if ASCII less than '0' is found */
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'),
+											svcmplt(pred, low, '0'))))
+			break;
+
+		/* fall back if invalid hexadecimal digit is found */
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+			break;
+
+		/* left-shift high and perform bitwise OR with low to form the byte */
+		svst1(pred, (uint8 *) dst, svorr_x(pred, svlsl_x(pred, high, 4), low));
+
+		processed = svcntp_b8(pred, pred) * 2;
+		src += processed;
+		i += processed;
+		dst += processed / 2;
+	}
+
+	if (i < len)	/* fall back */
+		return dst - p + hex_decode_safe_slow(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..b5096c11f4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
-- 
2.34.1

#21

Nathan Bossart

nathandbossart@gmail.com

12 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#19)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Jan 22, 2025 at 10:58:09AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

The functions that test the length before potentially calling a function
pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
I wouldn't be surprised if some compilers are inlining this stuff
already, but it's probably worth being explicit about it.

Should we implement an inline function in "utils/builtins.h", similar to
pg_popcount()? Currently, we have not modified the header file, everything
is statically implemented in encode.c.

Yeah, that's what I'm currently thinking we should do.

--
nathan

#22

Nathan Bossart

nathandbossart@gmail.com

12 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#20)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Jan 22, 2025 at 11:10:10AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

I realized I didn't attach the patch.

Thanks. Would you mind creating a commitfest entry for this one?

--
nathan

#23

Chiranmoy.Bhattacharya@fujitsu.com

11 months ago

In reply to: Nathan Bossart (#22)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Inlined the hex encode/decode functions in "src/include/utils/builtins.h"
similar to pg_popcount() in pg_bitutils.h.

---
Chiranmoy

Attachments:

v3-0001-SVE-support-for-hex-encode-and-hex-decode.patchapplication/octet-stream; name=v3-0001-SVE-support-for-hex-encode-and-hex-decode.patchDownload

From 015afdfa5b1eccc039bc1c276dd7a51d3729257a Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Tue, 4 Feb 2025 11:26:41 +0530
Subject: [PATCH v3] SVE support for hex encode and hex decode

---
 config/c-compiler.m4           |  58 +++++++++
 configure                      |  79 ++++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  56 +++++++++
 src/backend/utils/adt/encode.c | 212 ++++++++++++++++++++++++++++++++-
 src/include/pg_config.h.in     |   3 +
 src/include/utils/builtins.h   |  55 ++++++++-
 7 files changed, 466 insertions(+), 6 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..d99ecfb2a7 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE intrinsic required for hex coding:
+# svtbl, svlsr_x, svand_z, svcreate2, etc.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_hex_intrinsics])])dnl
+AC_CACHE_CHECK([for svtbl, svlsr_x, svand_z, svcreate2, etc], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int vec_len = svcntb();
+      char input@<:@32@:>@;
+      char output@<:@32@:>@;
+      svbool_t pred = svptrue_b8(), cmp1, cmp2;
+      svuint8_t bytes, hextbl_vec;
+      svuint8x2_t	merged;
+
+      if (vec_len >= 16)
+      {
+        /* intrinsics used in hex_encode_sve */
+        hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+        bytes = svld1(pred, (uint8_t *) input);
+        bytes = svlsr_x(pred, bytes, 4);
+        bytes = svand_x(pred, bytes, 0xF);
+        merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+        svst2(pred, (uint8_t *) output, merged);
+
+        /* intrinsics used in hex_decode_sve */
+        bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+        bytes = svsub_x(pred, bytes, 48);
+        cmp1 = svcmplt(pred, bytes, 16);
+        cmp2 = svcmpgt(pred, bytes, 9);
+        if (svptest_any(pred, svnot_z(pred, svorr_z(pred, cmp1, cmp2))))
+          return 0;
+        bytes = svsel(svand_z(pred, cmp1, cmp2), bytes, bytes);
+        bytes = svlsl_x(pred, bytes, svcntp_b8(pred, pred));
+        svst1(pred, output, bytes);
+
+        /* return computed value, to prevent the above being optimized away */
+        return output@<:@0@:>@ == 0;
+      }
+
+      return 0;
+    }],
+  [return hex_coding_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_HEX_INTRINSICS
diff --git a/configure b/configure
index ceeef9b091..e445cb1451 100755
--- a/configure
+++ b/configure
@@ -17168,6 +17168,85 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svtbl, svlsr_z, svand_z, svcreate2, etc" >&5
+$as_echo_n "checking for svtbl, svlsr_z, svand_z, svcreate2, etc... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int vec_len = svcntb();
+      char input[32];
+      char output[32];
+      svbool_t pred = svptrue_b8(), cmp1, cmp2;
+      svuint8_t bytes, hextbl_vec;
+      svuint8x2_t	merged;
+
+      if (vec_len >= 16)
+      {
+        /* intrinsics used in hex_encode_sve */
+        hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+        bytes = svld1(pred, (uint8_t *) input);
+        bytes = svlsr_x(pred, bytes, 4);
+        bytes = svand_x(pred, bytes, 0xF);
+        merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+        svst2(pred, (uint8_t *) output, merged);
+
+        /* intrinsics used in hex_decode_sve */
+        bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+        bytes = svsub_x(pred, bytes, 48);
+        cmp1 = svcmplt(pred, bytes, 16);
+        cmp2 = svcmpgt(pred, bytes, 9);
+        if (svptest_any(pred, svnot_z(pred, svorr_z(pred, cmp1, cmp2))))
+          return 0;
+        bytes = svsel(svand_z(pred, cmp1, cmp2), bytes, bytes);
+        bytes = svlsl_x(pred, bytes, svcntp_b8(pred, pred));
+        svst1(pred, output, bytes);
+
+        /* return computed value, to prevent the above being optimized away */
+        return output[0] == 0;
+      }
+
+      return 0;
+    }
+int
+main ()
+{
+return hex_coding_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index d713360f34..2dbb678cae 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 8e128f4982..6a10331acf 100644
--- a/meson.build
+++ b/meson.build
@@ -2194,6 +2194,62 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of ARM SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    int vec_len = svcntb();
+    char input[64] = {0};
+    char output[64] = {0};
+    svbool_t pred = svptrue_b8(), cmp1, cmp2;
+    svuint8_t bytes, hextbl_vec;
+    svuint8x2_t	merged;
+
+    if (vec_len >= 16)
+    {
+      /* intrinsics used in hex_encode_sve */
+      hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+      bytes = svld1(pred, (uint8_t *) input);
+      bytes = svlsr_x(pred, bytes, 4);
+      bytes = svand_x(pred, bytes, 0xF);
+      merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+      svst2(pred, (uint8_t *) output, merged);
+
+      /* intrinsics used in hex_decode_sve */
+      bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+      bytes = svsub_x(pred, bytes, 48);
+      cmp1 = svcmplt(pred, bytes, 16);
+      cmp2 = svcmpgt(pred, bytes, 9);
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, cmp1, cmp2))))
+        return 0;
+      bytes = svsel(svand_z(pred, cmp1, cmp2), bytes, bytes);
+      bytes = svlsl_x(pred, bytes, svcntp_b8(pred, pred));
+      svst1(pred, output, bytes);
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0] == 0;
+    }
+
+    return 0;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE hex coding', args: test_c_args)
+    cdata.set('USE_SVE_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d..cf0137a1f1 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -20,6 +20,12 @@
 #include "utils/memutils.h"
 #include "varatt.h"
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+#endif
 
 /*
  * Encoding conversion API.
@@ -177,8 +183,81 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst,
+								  Node *escontext);
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst,
+									 Node *escontext);
+uint64 (*hex_encode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 (*hex_decode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 (*hex_decode_safe_optimized)
+	   (const char *src, size_t len, char *dst, Node *escontext) =
+		hex_decode_safe_choose;
+
+/*
+ * Returns true if the CPU supports SVE instructions.
+ */
+static inline bool
+check_sve_support(void)
+{
+#if defined(HAVE_ELF_AUX_INFO) && defined(__aarch64__)  /* FreeBSD */
+	unsigned long value;
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL) && defined(__aarch64__)   /* Linux */
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_scalar;
+		hex_decode_optimized = hex_decode_scalar;
+		hex_decode_safe_optimized = hex_decode_safe_scalar;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
 uint64
-hex_encode(const char *src, size_t len, char *dst)
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -208,13 +287,13 @@ get_hex(const char *cp, char *out)
 }
 
 uint64
-hex_decode(const char *src, size_t len, char *dst)
+hex_decode_scalar(const char *src, size_t len, char *dst)
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
 uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +333,133 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+/*
+ * SVE implementation of hex_encode and hex_decode.
+ */
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	svbool_t	pred;
+	svuint8_t	bytes,
+				high,
+				low,
+				hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svuint8x2_t	merged;
+	uint32 		vec_len = svcntb();
+
+	for (size_t i = 0; i < len; i += vec_len)
+	{
+		pred = svwhilelt_b8(i, len);
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_x(pred, bytes, 4);	/* shift-right to get the high nibble */
+		low = svand_z(pred, bytes, 0xF);   /* mask high to get the low nibble */
+
+		/*
+		 * Convert the nibbles to hex digits by indexing into hextbl_vec,
+		 * for example, a nibble value of 10 indexed into hextbl_vec gives 'a'.
+		 * Finally, interleave the high and low nibbles.
+		 */
+		merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, merged);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	/*
+	 * Convert ASCII values '0'-'9' to integers 0-9 by subtracting 48.
+	 * Similarly, convert letters 'A'-'F' and 'a'-'f' to integers 10-15.
+	 */
+	svuint8_t	dgt_vec = svsub_x(pred, vec, 48),
+				cap_vec = svsub_x(pred, vec, 55),
+				sml_vec = svsub_x(pred, vec, 87),
+				ltr_vec;
+	/*
+	 * Identify valid integers in dgt_vec, cap_vec, and sml_vec.
+	 * Integers 0-9 are valid in dgt_vec, while integers 10-15 are valid
+	 * in cap_vec and sml_vec.
+	 */
+	svbool_t	valid_dgt = svcmplt(pred, dgt_vec, 10),
+				valid_ltr;
+
+	/* Combine cap_vec and sml_vec and mark the valid range 10-15. */
+	ltr_vec = svsel(svcmplt(pred, cap_vec, 16), cap_vec, sml_vec);
+	valid_ltr = svand_z(pred, svcmpgt(pred, ltr_vec, 9),
+							  svcmplt(pred, ltr_vec, 16));
+	/*
+	 * Check for invalid hexadecimal digits. Each value must fall
+	 * within the range 0-9 (true in valid_dgt) or 10-15 (true in valid_ltr).
+	 */
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_dgt, valid_ltr))))
+		return false;
+
+	/* Finally, combine dgt_vec and ltr_vec */
+	*res = svsel(valid_dgt, dgt_vec, ltr_vec);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	svbool_t	pred;
+	svuint8x2_t	bytes;
+	svuint8_t	high,
+				low;
+	uint32		processed;
+	size_t		i = 0,
+				loop_bytes = len & ~1;	/* handles inputs of odd length */
+	const char *p = dst;
+
+	while (i < loop_bytes)
+	{
+		pred = svwhilelt_b8(i / 2, len / 2);
+		bytes = svld2(pred, (uint8 *) src);
+		high = svget2(bytes, 0);	/* hex digits for high nibble */
+		low = svget2(bytes, 1);		/* hex digits for low nibble */
+
+		/* fallback if a character below ASCII '0' is found. */
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'),
+											svcmplt(pred, low, '0'))))
+			break;
+
+		/* fallback if invalid hexadecimal digit is found */
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+			break;
+
+		/* left-shift high and perform bitwise OR with low to form the byte */
+		svst1(pred, (uint8 *) dst, svorr_z(pred, svlsl_x(pred, high, 4), low));
+
+		processed = svcntp_b8(pred, pred) * 2;
+		src += processed;
+		i += processed;
+		dst += processed / 2;
+	}
+
+	if (i < len)	/* fall back */
+		return dst - p + hex_decode_safe_scalar(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..b5096c11f4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1c98c7d225..e9b1f963dd 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -35,11 +35,60 @@ extern int	errdatatype(Oid datatypeOid);
 extern int	errdomainconstraint(Oid datatypeOid, const char *conname);
 
 /* encode.c */
-extern uint64 hex_encode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode_safe(const char *src, size_t len, char *dst,
+extern uint64 hex_encode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_safe_scalar(const char *src, size_t len, char *dst,
 							  Node *escontext);
 
+/*
+ * We can use SVE intrinsics for hex-coding, but only if we can
+ * verify that the CPU supports it via a runtime check.
+ */
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*hex_encode_optimized)
+	   (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_optimized)
+	   (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_safe_optimized)
+	   (const char *src, size_t len, char *dst, Node *escontext);
+#endif		/* USE_SVE_WITH_RUNTIME_CHECK */
+
+static inline uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+	int	threshold = 16;
+
+	if (len >= threshold)
+		return hex_encode_optimized(src, len, dst);
+#endif
+	return hex_encode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_optimized(src, len, dst);
+#endif
+	return hex_decode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_safe_optimized(src, len, dst, escontext);
+#endif
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+}
+
 /* int.c */
 extern int2vector *buildint2vector(const int16 *int2s, int n);
 
-- 
2.34.1

#24

Chiranmoy.Bhattacharya@fujitsu.com

11 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#23)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

It seems that the patch doesn't compile on macOS, it is unable to map 'i'
and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac specific
issue. The latest patch should resolve this by casting 'size_t' to 'uint64' before
passing it to 'svwhilelt_b8'.
[11:04:07.478] ../src/backend/utils/adt/encode.c:356:10: error: call to 'svwhilelt_b8' is ambiguous
[11:04:07.478] 356 | pred = svwhilelt_b8(i, len);
[11:04:07.478] | ^~~~~~~~~~~~
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10: note: candidate function
[11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t);
[11:04:07.478] | ^
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10: note: candidate function
[11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t);
[11:04:07.478] | ^
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10: note: candidate function
[11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t);
[11:04:07.478] | ^
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10: note: candidate function
[11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t);
[11:04:07.478] | ^
[11:04:07.478] ../src/backend/utils/adt/encode.c:433:10: error: call to 'svwhilelt_b8' is ambiguous
[11:04:07.478] 433 | pred = svwhilelt_b8(i / 2, len / 2);
[11:04:07.478] | ^~~~~~~~~~~~
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10: note: candidate function
[11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t);
[11:04:07.478] | ^
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10: note: candidate function
[11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t);
[11:04:07.478] | ^
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10: note: candidate function
[11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t);
[11:04:07.478] | ^
[11:04:07.478] /Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10: note: candidate function
[11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t);
[11:04:07.478] | ^
[11:04:07.478] 2 errors generated.

---
Chiranmoy

Attachments:

v4-0001-SVE-support-for-hex-encode-and-hex-decode.patchapplication/octet-stream; name=v4-0001-SVE-support-for-hex-encode-and-hex-decode.patchDownload

From 015afdfa5b1eccc039bc1c276dd7a51d3729257a Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Tue, 4 Feb 2025 11:26:41 +0530
Subject: [PATCH v3] SVE support for hex encode and hex decode

---
 config/c-compiler.m4           |  58 +++++++++
 configure                      |  79 ++++++++++++
 configure.ac                   |   9 ++
 meson.build                    |  56 +++++++++
 src/backend/utils/adt/encode.c | 212 ++++++++++++++++++++++++++++++++-
 src/include/pg_config.h.in     |   3 +
 src/include/utils/builtins.h   |  55 ++++++++-
 7 files changed, 466 insertions(+), 6 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 8534cc54c1..d99ecfb2a7 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -704,3 +704,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_AVX512_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the ARM SVE intrinsic required for hex coding:
+# svtbl, svlsr_x, svand_z, svcreate2, etc.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_hex_intrinsics])])dnl
+AC_CACHE_CHECK([for svtbl, svlsr_x, svand_z, svcreate2, etc], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int vec_len = svcntb();
+      char input@<:@32@:>@;
+      char output@<:@32@:>@;
+      svbool_t pred = svptrue_b8(), cmp1, cmp2;
+      svuint8_t bytes, hextbl_vec;
+      svuint8x2_t	merged;
+
+      if (vec_len >= 16)
+      {
+        /* intrinsics used in hex_encode_sve */
+        hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+        bytes = svld1(pred, (uint8_t *) input);
+        bytes = svlsr_x(pred, bytes, 4);
+        bytes = svand_x(pred, bytes, 0xF);
+        merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+        svst2(pred, (uint8_t *) output, merged);
+
+        /* intrinsics used in hex_decode_sve */
+        bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+        bytes = svsub_x(pred, bytes, 48);
+        cmp1 = svcmplt(pred, bytes, 16);
+        cmp2 = svcmpgt(pred, bytes, 9);
+        if (svptest_any(pred, svnot_z(pred, svorr_z(pred, cmp1, cmp2))))
+          return 0;
+        bytes = svsel(svand_z(pred, cmp1, cmp2), bytes, bytes);
+        bytes = svlsl_x(pred, bytes, svcntp_b8(pred, pred));
+        svst1(pred, output, bytes);
+
+        /* return computed value, to prevent the above being optimized away */
+        return output@<:@0@:>@ == 0;
+      }
+
+      return 0;
+    }],
+  [return hex_coding_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_HEX_INTRINSICS
diff --git a/configure b/configure
index ceeef9b091..e445cb1451 100755
--- a/configure
+++ b/configure
@@ -17168,6 +17168,85 @@ $as_echo "#define USE_AVX512_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svtbl, svlsr_z, svand_z, svcreate2, etc" >&5
+$as_echo_n "checking for svtbl, svlsr_z, svand_z, svcreate2, etc... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int vec_len = svcntb();
+      char input[32];
+      char output[32];
+      svbool_t pred = svptrue_b8(), cmp1, cmp2;
+      svuint8_t bytes, hextbl_vec;
+      svuint8x2_t	merged;
+
+      if (vec_len >= 16)
+      {
+        /* intrinsics used in hex_encode_sve */
+        hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+        bytes = svld1(pred, (uint8_t *) input);
+        bytes = svlsr_x(pred, bytes, 4);
+        bytes = svand_x(pred, bytes, 0xF);
+        merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+        svst2(pred, (uint8_t *) output, merged);
+
+        /* intrinsics used in hex_decode_sve */
+        bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+        bytes = svsub_x(pred, bytes, 48);
+        cmp1 = svcmplt(pred, bytes, 16);
+        cmp2 = svcmpgt(pred, bytes, 9);
+        if (svptest_any(pred, svnot_z(pred, svorr_z(pred, cmp1, cmp2))))
+          return 0;
+        bytes = svsel(svand_z(pred, cmp1, cmp2), bytes, bytes);
+        bytes = svlsl_x(pred, bytes, svcntp_b8(pred, pred));
+        svst1(pred, output, bytes);
+
+        /* return computed value, to prevent the above being optimized away */
+        return output[0] == 0;
+      }
+
+      return 0;
+    }
+int
+main ()
+{
+return hex_coding_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index d713360f34..2dbb678cae 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2021,6 +2021,15 @@ if test x"$host_cpu" = x"x86_64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index 8e128f4982..6a10331acf 100644
--- a/meson.build
+++ b/meson.build
@@ -2194,6 +2194,62 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of ARM SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    int vec_len = svcntb();
+    char input[64] = {0};
+    char output[64] = {0};
+    svbool_t pred = svptrue_b8(), cmp1, cmp2;
+    svuint8_t bytes, hextbl_vec;
+    svuint8x2_t	merged;
+
+    if (vec_len >= 16)
+    {
+      /* intrinsics used in hex_encode_sve */
+      hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) "0123456789ABCDEF");
+      bytes = svld1(pred, (uint8_t *) input);
+      bytes = svlsr_x(pred, bytes, 4);
+      bytes = svand_x(pred, bytes, 0xF);
+      merged = svcreate2(svtbl(hextbl_vec, bytes), svtbl(hextbl_vec, bytes));
+      svst2(pred, (uint8_t *) output, merged);
+
+      /* intrinsics used in hex_decode_sve */
+      bytes = svget2(svld2(pred, (uint8_t *) output), 0);
+      bytes = svsub_x(pred, bytes, 48);
+      cmp1 = svcmplt(pred, bytes, 16);
+      cmp2 = svcmpgt(pred, bytes, 9);
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, cmp1, cmp2))))
+        return 0;
+      bytes = svsel(svand_z(pred, cmp1, cmp2), bytes, bytes);
+      bytes = svlsl_x(pred, bytes, svcntp_b8(pred, pred));
+      svst1(pred, output, bytes);
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0] == 0;
+    }
+
+    return 0;
+}
+'''
+
+  if cc.links(prog, name: 'ARM SVE hex coding', args: test_c_args)
+    cdata.set('USE_SVE_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d..cf0137a1f1 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -20,6 +20,12 @@
 #include "utils/memutils.h"
 #include "varatt.h"
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+#endif
 
 /*
  * Encoding conversion API.
@@ -177,8 +183,81 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst,
+								  Node *escontext);
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst,
+									 Node *escontext);
+uint64 (*hex_encode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 (*hex_decode_optimized)
+	   (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 (*hex_decode_safe_optimized)
+	   (const char *src, size_t len, char *dst, Node *escontext) =
+		hex_decode_safe_choose;
+
+/*
+ * Returns true if the CPU supports SVE instructions.
+ */
+static inline bool
+check_sve_support(void)
+{
+#if defined(HAVE_ELF_AUX_INFO) && defined(__aarch64__)  /* FreeBSD */
+	unsigned long value;
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL) && defined(__aarch64__)   /* Linux */
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_scalar;
+		hex_decode_optimized = hex_decode_scalar;
+		hex_decode_safe_optimized = hex_decode_safe_scalar;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
 uint64
-hex_encode(const char *src, size_t len, char *dst)
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -208,13 +287,13 @@ get_hex(const char *cp, char *out)
 }
 
 uint64
-hex_decode(const char *src, size_t len, char *dst)
+hex_decode_scalar(const char *src, size_t len, char *dst)
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
 uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +333,133 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+/*
+ * SVE implementation of hex_encode and hex_decode.
+ */
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	svbool_t	pred;
+	svuint8_t	bytes,
+				high,
+				low,
+				hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svuint8x2_t	merged;
+	uint32 		vec_len = svcntb();
+
+	for (size_t i = 0; i < len; i += vec_len)
+	{
+		pred = svwhilelt_b8((uint64) i, (uint64) len);
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_x(pred, bytes, 4);	/* shift-right to get the high nibble */
+		low = svand_z(pred, bytes, 0xF);   /* mask high to get the low nibble */
+
+		/*
+		 * Convert the nibbles to hex digits by indexing into hextbl_vec,
+		 * for example, a nibble value of 10 indexed into hextbl_vec gives 'a'.
+		 * Finally, interleave the high and low nibbles.
+		 */
+		merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, merged);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	/*
+	 * Convert ASCII values '0'-'9' to integers 0-9 by subtracting 48.
+	 * Similarly, convert letters 'A'-'F' and 'a'-'f' to integers 10-15.
+	 */
+	svuint8_t	dgt_vec = svsub_x(pred, vec, 48),
+				cap_vec = svsub_x(pred, vec, 55),
+				sml_vec = svsub_x(pred, vec, 87),
+				ltr_vec;
+	/*
+	 * Identify valid integers in dgt_vec, cap_vec, and sml_vec.
+	 * Integers 0-9 are valid in dgt_vec, while integers 10-15 are valid
+	 * in cap_vec and sml_vec.
+	 */
+	svbool_t	valid_dgt = svcmplt(pred, dgt_vec, 10),
+				valid_ltr;
+
+	/* Combine cap_vec and sml_vec and mark the valid range 10-15. */
+	ltr_vec = svsel(svcmplt(pred, cap_vec, 16), cap_vec, sml_vec);
+	valid_ltr = svand_z(pred, svcmpgt(pred, ltr_vec, 9),
+							  svcmplt(pred, ltr_vec, 16));
+	/*
+	 * Check for invalid hexadecimal digits. Each value must fall
+	 * within the range 0-9 (true in valid_dgt) or 10-15 (true in valid_ltr).
+	 */
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_dgt, valid_ltr))))
+		return false;
+
+	/* Finally, combine dgt_vec and ltr_vec */
+	*res = svsel(valid_dgt, dgt_vec, ltr_vec);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	svbool_t	pred;
+	svuint8x2_t	bytes;
+	svuint8_t	high,
+				low;
+	uint32		processed;
+	size_t		i = 0,
+				loop_bytes = len & ~1;	/* handles inputs of odd length */
+	const char *p = dst;
+
+	while (i < loop_bytes)
+	{
+		pred = svwhilelt_b8((uint64) i / 2, (uint64) len / 2);
+		bytes = svld2(pred, (uint8 *) src);
+		high = svget2(bytes, 0);	/* hex digits for high nibble */
+		low = svget2(bytes, 1);		/* hex digits for low nibble */
+
+		/* fallback if a character below ASCII '0' is found. */
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'),
+											svcmplt(pred, low, '0'))))
+			break;
+
+		/* fallback if invalid hexadecimal digit is found */
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+			break;
+
+		/* left-shift high and perform bitwise OR with low to form the byte */
+		svst1(pred, (uint8 *) dst, svorr_z(pred, svlsl_x(pred, high, 4), low));
+
+		processed = svcntp_b8(pred, pred) * 2;
+		src += processed;
+		i += processed;
+		dst += processed / 2;
+	}
+
+	if (i < len)	/* fall back */
+		return dst - p + hex_decode_safe_scalar(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+#endif							/* USE_SVE_WITH_RUNTIME_CHECK */
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 07b2f798ab..b5096c11f4 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -648,6 +648,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1c98c7d225..e9b1f963dd 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -35,11 +35,60 @@ extern int	errdatatype(Oid datatypeOid);
 extern int	errdomainconstraint(Oid datatypeOid, const char *conname);
 
 /* encode.c */
-extern uint64 hex_encode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode_safe(const char *src, size_t len, char *dst,
+extern uint64 hex_encode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_safe_scalar(const char *src, size_t len, char *dst,
 							  Node *escontext);
 
+/*
+ * We can use SVE intrinsics for hex-coding, but only if we can
+ * verify that the CPU supports it via a runtime check.
+ */
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*hex_encode_optimized)
+	   (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_optimized)
+	   (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_safe_optimized)
+	   (const char *src, size_t len, char *dst, Node *escontext);
+#endif		/* USE_SVE_WITH_RUNTIME_CHECK */
+
+static inline uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+	int	threshold = 16;
+
+	if (len >= threshold)
+		return hex_encode_optimized(src, len, dst);
+#endif
+	return hex_encode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_optimized(src, len, dst);
+#endif
+	return hex_decode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_SVE_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_safe_optimized(src, len, dst, escontext);
+#endif
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+}
+
 /* int.c */
 extern int2vector *buildint2vector(const int16 *int2s, int n);
 
-- 
2.34.1

#25

Nathan Bossart

nathandbossart@gmail.com

7 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#24)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

I have marked the commitfest entry for this [0]https://commitfest.postgresql.org/patch/5538/ as waiting-on-author
because the patch needs to be rebased.

[0]: https://commitfest.postgresql.org/patch/5538/

--
nathan

#26

Chiranmoy.Bhattacharya@fujitsu.com

7 months ago

In reply to: Nathan Bossart (#25)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Here's the rebased patch with a few modifications.

The hand-unrolled hex encode performs better than the non-unrolled version on
r8g.4xlarge. No improvement on m7g.4xlarge.
Added line-by-line comments explaining the changes with an example.

Below are the results. Input size is in bytes, and exec time is in ms.

encode - r8g.4xlarge

Input | master | SVE | SVE-unrolled
-------+--------+--------+--------------
8 | 4.971 | 6.434 | 6.623
16 | 8.532 | 4.399 | 4.710
24 | 12.296 | 5.007 | 5.780
32 | 16.003 | 5.027 | 5.234
40 | 19.628 | 5.807 | 6.201
48 | 23.277 | 5.815 | 6.222
56 | 26.927 | 6.744 | 7.030
64 | 30.419 | 6.774 | 6.347
128 | 83.250 | 10.214 | 9.104
256 |112.158 | 17.892 | 16.313
512 |216.544 | 31.060 | 29.876
1024 |429.351 | 59.310 | 53.374
2048 |854.677 |116.769 | 101.004
4096 |1706.528|237.322 | 195.297
8192 |3723.884|499.520 | 385.424
---------------------------------------

encode - m7g.4xlarge

Input | master | SVE | SVE-unrolled
-------+--------+--------+--------------
8 | 5.503 | 7.986 | 8.053
16 | 9.881 | 9.583 | 9.888
24 | 13.854 | 9.212 | 10.138
32 | 18.056 | 9.208 | 9.364
40 | 22.127 | 10.134 | 10.540
48 | 26.214 | 10.186 | 10.550
56 | 29.718 | 10.197 | 10.428
64 | 33.613 | 10.982 | 10.497
128 | 66.060 | 12.460 | 12.624
256 |130.225 | 18.491 | 18.872
512 |267.105 | 30.343 | 31.661
1024 |515.603 | 54.371 | 55.341
2048 |1013.766|103.898 | 105.192
4096 |2018.705|202.653 | 203.142
8192 |4000.496|400.918 | 401.842
---------------------------------------
decode - r8g.4xlarge

Input | master | SVE
-------+--------+--------
8 | 7.641 | 8.787
16 | 14.301 | 14.477
32 | 28.663 | 6.091
48 | 42.940 | 17.604
64 | 57.483 | 10.549
80 | 71.637 | 19.194
96 | 85.918 | 15.586
112 |100.272 | 25.956
128 |114.740 | 19.829
256 |229.176 | 36.032
512 |458.295 | 68.222
1024 |916.741 |132.927
2048 |1833.422|262.741
4096 |3667.096|522.009
8192 |7333.886|1042.447
---------------------------------------

decode - m7g.4xlarge

Input | master | SVE
-------+--------+--------
8 | 8.194 | 9.433
16 | 14.397 | 15.606
32 | 26.669 | 29.006
48 | 45.971 | 48.984
64 | 58.468 | 12.388
80 | 70.820 | 22.295
96 | 84.792 | 43.470
112 | 98.992 | 54.282
128 |113.250 | 25.508
256 |218.743 | 45.165
512 |414.133 | 86.800
1024 |828.493 |174.670
2048 |1617.921|346.375
4096 |3259.159|689.391
8192 |6551.879|1376.195

--------
Chiranmoy

Attachments:

v5-0001-SVE-support-for-hex-coding.patchapplication/octet-stream; name=v5-0001-SVE-support-for-hex-coding.patchDownload

From 3a508684171ae411e4e8251c717b61a8def04c1f Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Mon, 9 Jun 2025 14:16:26 +0530
Subject: [PATCH v5] SVE support for hex coding

---
 config/c-compiler.m4                   |  85 ++++++++
 configure                              | 104 +++++++++
 configure.ac                           |   9 +
 meson.build                            |  81 +++++++
 src/backend/utils/adt/Makefile         |   1 +
 src/backend/utils/adt/encode.c         |   6 +-
 src/backend/utils/adt/encode_aarch64.c | 278 +++++++++++++++++++++++++
 src/backend/utils/adt/meson.build      |   1 +
 src/include/pg_config.h.in             |   3 +
 src/include/utils/builtins.h           |  51 ++++-
 10 files changed, 613 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/adt/encode_aarch64.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 5f3e1d1faf9..20e71cd8546 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -797,3 +797,88 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SVE_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the SVE intrinsic required for hex coding:
+# svsub_x, svcmplt, svsel, svcmpgt, svtbl, svlsr_x, svand_z, svcreate2,
+# svptest_any, svnot_z, svorr_z, svcntb, svld1, svwhilelt_b8, svst2, svld2,
+# svget2, svst1 and svlsl_x.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_hex_intrinsics])])dnl
+AC_CACHE_CHECK([for svtbl, svlsr_x, svand_z, svcreate2, etc], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+
+    char input@<:@64@:>@;
+    char output@<:@128@:>@;
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+    {
+      svuint8_t	digit = svsub_x(pred, vec, 48),
+                upper = svsub_x(pred, vec, 55),
+                lower = svsub_x(pred, vec, 87);
+      svbool_t	valid_digit = svcmplt(pred, digit, 10),
+                valid_upper = svcmplt(pred, upper, 16);
+      svuint8_t	letter = svsel(valid_upper, upper, lower);
+      svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+                                            svcmplt(pred, letter, 16));
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+        return 0;
+      *res = svsel(valid_digit, digit, letter);
+      return 1;
+    }
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+      const char	*hextbl = "0123456789abcdef";
+      svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+      char *src = input, *dst = output;
+
+      /* hex encode */
+      for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+      {
+        svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+        svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                  high = svlsr_x(pred, bytes, 4),
+                  low = svand_z(pred, bytes, 0xF);
+        svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+        svst2(pred, (uint8_t *) dst, merged);
+      }
+
+      /* hex decode */
+      len = 128;
+
+      for (int i; i < len; i += vec_len_x2)
+      {
+        svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+        svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+        svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+        if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+          break;
+        if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+          break;
+
+        svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+      }
+
+      /* return computed value, to prevent the above being optimized away */
+      return output@<:@0@:>@;
+    }],
+  [return hex_coding_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_HEX_INTRINSICS
diff --git a/configure b/configure
index 4f15347cc95..4d5d6acefb5 100755
--- a/configure
+++ b/configure
@@ -17851,6 +17851,110 @@ $as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svtbl, svlsr_x, svand_z, svcreate2, etc" >&5
+$as_echo_n "checking for svtbl, svlsr_x, svand_z, svcreate2, etc... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+    char input[64];
+    char output[128];
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+    {
+      svuint8_t	digit = svsub_x(pred, vec, 48),
+                upper = svsub_x(pred, vec, 55),
+                lower = svsub_x(pred, vec, 87);
+      svbool_t	valid_digit = svcmplt(pred, digit, 10),
+                valid_upper = svcmplt(pred, upper, 16);
+      svuint8_t	letter = svsel(valid_upper, upper, lower);
+      svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+                                            svcmplt(pred, letter, 16));
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+        return 0;
+      *res = svsel(valid_digit, digit, letter);
+      return 1;
+    }
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+      const char	*hextbl = "0123456789abcdef";
+      svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+      char *src = input, *dst = output;
+
+      /* hex encode */
+      for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+      {
+        svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+        svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                  high = svlsr_x(pred, bytes, 4),
+                  low = svand_z(pred, bytes, 0xF);
+        svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+        svst2(pred, (uint8_t *) dst, merged);
+      }
+
+      /* hex decode */
+      len = 128;
+
+      for (int i; i < len; i += vec_len_x2)
+      {
+        svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+        svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+        svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+        if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+          break;
+        if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+          break;
+
+        svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+      }
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0];
+    }
+int
+main ()
+{
+return hex_coding_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_HEX_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index 4b8335dc613..fcae9b84616 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,15 @@ if test x"$host_cpu" = x"aarch64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_HEX_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index d142e3e408b..de2d1ebd384 100644
--- a/meson.build
+++ b/meson.build
@@ -2384,6 +2384,87 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+char input[64];
+char output[128];
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	svuint8_t	digit = svsub_x(pred, vec, 48),
+				    upper = svsub_x(pred, vec, 55),
+				    lower = svsub_x(pred, vec, 87);
+	svbool_t	valid_digit = svcmplt(pred, digit, 10),
+            valid_upper = svcmplt(pred, upper, 16);
+	svuint8_t	letter = svsel(valid_upper, upper, lower);
+	svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+							  				                 svcmplt(pred, letter, 16));
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+		return 0;
+	*res = svsel(valid_digit, digit, letter);
+	return 1;
+}
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+    const char	hextbl[] = "0123456789abcdef";
+    svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+    char *src = input, *dst = output;
+
+    /* hex encode */
+    for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+    {
+      svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+      svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                high = svlsr_x(pred, bytes, 4),
+                low = svand_z(pred, bytes, 0xF);
+      svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+      svst2(pred, (uint8_t *) dst, merged);
+    }
+
+    /* hex decode */
+    len = 128;
+
+    for (int i; i < len; i += vec_len_x2)
+    {
+      svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+      svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+      svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+      if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+        break;
+      if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+        break;
+
+      svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+    }
+    
+    /* return computed value, to prevent the above being optimized away */
+    return output[0];
+}
+'''
+
+  if cc.links(prog, name: 'SVE hex coding', args: test_c_args)
+    cdata.set('USE_SVE_HEX_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 4a233b63c32..2a3ba1d4485 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -32,6 +32,7 @@ OBJS = \
 	dbsize.o \
 	domains.o \
 	encode.o \
+	encode_aarch64.o \
 	enum.o \
 	expandeddatum.o \
 	expandedrecord.o \
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d1..fa62ce3107d 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -178,7 +178,7 @@ static const int8 hexlookup[128] = {
 };
 
 uint64
-hex_encode(const char *src, size_t len, char *dst)
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -208,13 +208,13 @@ get_hex(const char *cp, char *out)
 }
 
 uint64
-hex_decode(const char *src, size_t len, char *dst)
+hex_decode_scalar(const char *src, size_t len, char *dst)
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
 uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
diff --git a/src/backend/utils/adt/encode_aarch64.c b/src/backend/utils/adt/encode_aarch64.c
new file mode 100644
index 00000000000..574a7550469
--- /dev/null
+++ b/src/backend/utils/adt/encode_aarch64.c
@@ -0,0 +1,278 @@
+/*-------------------------------------------------------------------------
+ *
+ * encode_aarch64.c
+ *	  Holds the SVE hex encode/decode implementations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/encode_aarch64.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <c.h>
+
+#include "utils/builtins.h"
+
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * These are the SVE implementations of the hex encode/decode functions.
+ */
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext);
+
+/*
+ * The function pointers are initially set to "choose" functions.  These
+ * functions will first set the pointers to the right implementations (based on
+ * what the current CPU supports) and then will call the pointer to fulfill the
+ * caller's request.
+ */
+
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext);
+uint64 		(*hex_encode_optimized) (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 		(*hex_decode_optimized) (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 		(*hex_decode_safe_optimized) (const char *src, size_t len, char *dst, Node *escontext) = hex_decode_safe_choose;
+
+static inline bool
+check_sve_support(void)
+{
+#ifdef HAVE_ELF_AUX_INFO
+	unsigned long value;
+
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL)
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_scalar;
+		hex_decode_optimized = hex_decode_scalar;
+		hex_decode_safe_optimized = hex_decode_safe_scalar;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	uint32 		vec_len = svcntb();
+	svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svbool_t	pred = svptrue_b8();
+	size_t		loop_bytes = len & ~(2 * vec_len - 1); /* process 2 * vec_len byte chunk each iteration */
+	svuint8_t	bytes, high, low;
+	svuint8x2_t	zipped;
+
+	for (size_t i = 0; i < loop_bytes; i += 2 * vec_len)
+	{
+		bytes = svld1(pred, (uint8 *) src);
+		
+		/* Right-shift to obtain the high nibble */
+		high = svlsr_x(pred, bytes, 4);
+
+		/* Mask the high nibble to obtain the low nibble */
+		low = svand_z(pred, bytes, 0xF);
+
+		/*
+		 * Convert the high and low nibbles to hexadecimal digits using a
+		 * vectorized table lookup and zip (interleave) the hexadecimal digits.
+		 */
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+
+		/* unrolled */
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_x(pred, bytes, 4);
+		low = svand_z(pred, bytes, 0xF);
+
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	/* process remaining tail bytes */
+	for (size_t i = loop_bytes; i < len; i += vec_len)
+	{
+		pred = svwhilelt_b8((uint64) i, (uint64) len);
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_x(pred, bytes, 4);
+		low = svand_z(pred, bytes, 0xF);
+
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+/*
+ * get_hex_sve
+ *      Returns true if the hexadecimal digits are successfully converted
+ *      to nibbles and stored in 'res'; otherwise, returns false.
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	/*
+	 * Convert ASCII of '0'-'9' to integers 0-9 by subtracting 48 (ASCII of '0').
+	 * Similarly, convert letters 'A'–'F' and 'a'–'f' to integers 10–15 by
+	 * subtracting 55 ('A' - 10) and 87 ('a' - 10).
+	 */
+	svuint8_t	digit = svsub_x(pred, vec, '0'),
+				upper = svsub_x(pred, vec, 'A' - 10),
+				lower = svsub_x(pred, vec, 'a' - 10);
+
+	/*
+	 * Identify valid values in digits, upper, and lower vectors.
+	 * Values 0-9 are valid in digits, while values 10-15 are valid
+	 * in upper and lower.
+	 *
+	 * Example:
+	 * 		vec: 				'0'  '9'  'A'  'F'  'a'  'f'
+	 * 		vec (in ASCII):		48   57   65   70   97   102
+	 *
+	 * 		digit:	 			0    9    17   22   49   54
+	 * 		valid_digit:		1	 1	   0	0	 0	  0
+	 *
+	 * 		upper:				249  2    10   15   42   47
+	 * 		valid_upper:		0	 1	   1	1	 0	  0
+	 *
+	 * 		lower:				217  226  234  239  10   15
+	 *
+	 * Note that values 0-9 are also marked valid in valid_upper, this will be
+	 * handled later.
+	 */
+	svbool_t	valid_digit = svcmplt(pred, digit, 10),
+				valid_upper = svcmplt(pred, upper, 16);
+
+	/*
+	 * Merge upper and lower vector using the logic: take the element from
+	 * upper if it's true in valid_upper else pick the element in lower
+	 *
+	 * Mark the valid range i.e. 10-15 in letter vector
+	 *
+	 * 		letter:				217  2    10   15   10   15
+	 * 		valid_letter:		0	 0	   1	1    1	  1
+	 */
+
+	svuint8_t	letter = svsel(valid_upper, upper, lower);
+	svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+											 svcmplt(pred, letter, 16));
+
+	/*
+	 * Check for invalid hexadecimal digit. Each value must fall within
+	 * the range 0-9 (true in valid_digit) or 10-15 (true in valid_letter) i.e.
+	 * the OR of valid_digit and valid_letter should be all true.
+	 */
+
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+		return false;
+
+	/*
+	 * Finally, combine digit and letter vectors using the logic:
+	 * take the element from digit if it's true in valid_digit else pick the
+	 * element in letter.
+	 * 
+	 * 		res:	 			0    9    10   15   10   15
+	 */
+
+	*res = svsel(valid_digit, digit, letter);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	uint32		vec_len = svcntb();
+	size_t		loop_bytes = len & ~(2 * vec_len - 1); /* process 2 * vec_len byte chunk each iteration */
+	svbool_t 	pred = svptrue_b8();
+	const char *p = dst;
+
+	for (size_t i = 0; i < loop_bytes; i += 2 * vec_len)
+	{
+		svuint8x2_t bytes = svld2(pred, (uint8 *) src);
+		svuint8_t 	high = svget2(bytes, 0),
+				  	low = svget2(bytes, 1);
+
+		/* fallback for characters with ASCII values below '0' */
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+			break;
+
+		/* fallback if an invalid hexadecimal digit is found */
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+			break;
+
+		/* form the byte by left-shifting the high nibble and OR-ing it with the low nibble */
+		svst1(pred, (uint8 *) dst, svorr_z(pred, svlsl_x(pred, high, 4), low));
+
+		src += 2 * vec_len;
+		dst += vec_len;
+	}
+
+	if (len > loop_bytes) /* fallback */
+		return dst - p + hex_decode_safe_scalar(src, len - loop_bytes, dst, escontext);
+
+	return dst - p;
+}
+
+#endif	/* USE_SVE_HEX_WITH_RUNTIME_CHECK */
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index 244f48f4fd7..ea88dd77390 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -21,6 +21,7 @@ backend_sources += files(
   'dbsize.c',
   'domains.c',
   'encode.c',
+  'encode_aarch64.c',
   'enum.c',
   'expandeddatum.c',
   'expandedrecord.c',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 726a7c1be1f..7a227f1875f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_HEX_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1c98c7d2255..2f72d8df9d1 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -35,11 +35,56 @@ extern int	errdatatype(Oid datatypeOid);
 extern int	errdomainconstraint(Oid datatypeOid, const char *conname);
 
 /* encode.c */
-extern uint64 hex_encode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode_safe(const char *src, size_t len, char *dst,
+extern uint64 hex_encode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_safe_scalar(const char *src, size_t len, char *dst,
 							  Node *escontext);
 
+/*
+ * On AArch64, we can try to use an SVE optimized hex encode/decode on some systems.
+ */
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*hex_encode_optimized) (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_optimized) (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_safe_optimized) (const char *src, size_t len, char *dst, Node *escontext);
+#endif
+
+static inline uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+	int	threshold = 16;
+
+	if (len >= threshold)
+		return hex_encode_optimized(src, len, dst);
+#endif
+	return hex_encode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_optimized(src, len, dst);
+#endif
+	return hex_decode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_safe_optimized(src, len, dst, escontext);
+#endif
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+}
+
 /* int.c */
 extern int2vector *buildint2vector(const int16 *int2s, int n);
 
-- 
2.34.1

#27

Chiranmoy.Bhattacharya@fujitsu.com

6 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#26)

3 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Attaching the rebased patch, some regression tests for SIMD hex-coding,
and a script to test bytea performance (usage info in the script).

The results obtained using the script on an m7g.4xlarge are shown below.

Read Operation
table (MB) | HEAD (ms) | SVE (ms) | improvement (%)
---------------------------------------------------
52 | 136 | 111 | 18.38
105 | 215 | 164 | 23.72
209 | 452 | 331 | 26.76
419 | 830 | 602 | 27.46

Write Operation - table size after write
table (MB) | HEAD (ms) | SVE (ms) | improvement (%)
---------------------------------------------------
52 | 1430 | 1361 | 4.82
105 | 2956 | 2816 | 4.73
The bytea write numbers are averaged over 7 runs, with the table
truncated and vacuumed after each run.
--------
Chiranmoy

Attachments:

bytea_test.pytext/x-python; name=bytea_test.pyDownload

v1-0001-hex-coding-regress-test.patchapplication/octet-stream; name=v1-0001-hex-coding-regress-test.patchDownload

From 9066e3296160af4e703f90f460f8e75471b6425d Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Sun, 6 Jul 2025 19:25:28 +0530
Subject: [PATCH v1] hex coding regress test

---
 src/test/regress/expected/hex_coding.out | 63 ++++++++++++++++++++++++
 src/test/regress/parallel_schedule       |  5 ++
 src/test/regress/sql/hex_coding.sql      | 39 +++++++++++++++
 3 files changed, 107 insertions(+)
 create mode 100644 src/test/regress/expected/hex_coding.out
 create mode 100644 src/test/regress/sql/hex_coding.sql

diff --git a/src/test/regress/expected/hex_coding.out b/src/test/regress/expected/hex_coding.out
new file mode 100644
index 00000000000..e6d78fa4876
--- /dev/null
+++ b/src/test/regress/expected/hex_coding.out
@@ -0,0 +1,63 @@
+--
+-- tests for hex_encode and hex_decode in encode.c
+--
+-- Build table for testing
+CREATE TABLE BYTEA_TABLE(data BYTEA);
+-- hex_decode is used for inserting into bytea column
+-- Set bytea_output to hex so that hex_encode is used and tested
+SET bytea_output = 'hex';
+INSERT INTO BYTEA_TABLE VALUES ('\xAB');
+INSERT INTO BYTEA_TABLE VALUES ('\x01ab');
+INSERT INTO BYTEA_TABLE VALUES ('\xDEADC0DE');
+INSERT INTO BYTEA_TABLE VALUES ('\xbaadf00d');
+INSERT INTO BYTEA_TABLE VALUES ('\x C001   c0ffee  '); -- hex string with whitespaces
+-- errors checking
+INSERT INTO BYTEA_TABLE VALUES ('\xbadf00d'); -- odd number of hex digits
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xbadf00d');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xdeadcode'); -- invalid hexadecimal digit: "o"
+ERROR:  invalid hexadecimal digit: "o"
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xdeadcode');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC0FFEE'); -- invalid hexadecimal digit: "L"
+ERROR:  invalid hexadecimal digit: "L"
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xC00LC0FFEE');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC*DE'); -- invalid hexadecimal digit: "*"
+ERROR:  invalid hexadecimal digit: "L"
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xC00LC*DE');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xbad f00d'); -- invalid hexadecimal digit: " "
+ERROR:  invalid hexadecimal digit: " "
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xbad f00d');
+                                        ^
+-- long hex strings to test SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || repeat('baadf00d', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || '   ' || repeat('baad f00d', 8))::bytea; -- hex string with whitespaces
+-- errors checking for SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'badf00d' || repeat('DEADC0DE', 4))::bytea; -- odd number of hex digits
+ERROR:  invalid hexadecimal data: odd number of digits
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'baadfood'|| repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "o"
+ERROR:  invalid hexadecimal digit: "o"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'C00LC0FFEE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "L"
+ERROR:  invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'C00LC*DE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "*"
+ERROR:  invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'bad f00d' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: " "
+ERROR:  invalid hexadecimal digit: " "
+SELECT encode(data, 'hex') FROM BYTEA_TABLE;
+                                                              encode                                                              
+----------------------------------------------------------------------------------------------------------------------------------
+ ab
+ 01ab
+ deadc0de
+ baadf00d
+ c001c0ffee
+ deadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0de
+ deadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0debaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00d
+ deadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0debaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00d
+(8 rows)
+
+DROP TABLE BYTEA_TABLE;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index a424be2a6bf..8812d80d592 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -109,6 +109,11 @@ test: select_views portals_p2 foreign_key cluster dependency guc bitmapops combo
 # ----------
 test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson sqljson_queryfuncs sqljson_jsontable
 
+# ----------
+# Another group of parallel tests for hex encode/decode
+# ----------
+test: hex_coding
+
 # ----------
 # Another group of parallel tests
 # with depends on create_misc
diff --git a/src/test/regress/sql/hex_coding.sql b/src/test/regress/sql/hex_coding.sql
new file mode 100644
index 00000000000..97c51b62e90
--- /dev/null
+++ b/src/test/regress/sql/hex_coding.sql
@@ -0,0 +1,39 @@
+--
+-- tests for hex_encode and hex_decode in encode.c
+--
+
+-- Build table for testing
+CREATE TABLE BYTEA_TABLE(data BYTEA);
+
+-- hex_decode is used for inserting into bytea column
+-- Set bytea_output to hex so that hex_encode is used and tested
+SET bytea_output = 'hex';
+
+INSERT INTO BYTEA_TABLE VALUES ('\xAB');
+INSERT INTO BYTEA_TABLE VALUES ('\x01ab');
+INSERT INTO BYTEA_TABLE VALUES ('\xDEADC0DE');
+INSERT INTO BYTEA_TABLE VALUES ('\xbaadf00d');
+INSERT INTO BYTEA_TABLE VALUES ('\x C001   c0ffee  '); -- hex string with whitespaces
+
+-- errors checking
+INSERT INTO BYTEA_TABLE VALUES ('\xbadf00d'); -- odd number of hex digits
+INSERT INTO BYTEA_TABLE VALUES ('\xdeadcode'); -- invalid hexadecimal digit: "o"
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC0FFEE'); -- invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC*DE'); -- invalid hexadecimal digit: "*"
+INSERT INTO BYTEA_TABLE VALUES ('\xbad f00d'); -- invalid hexadecimal digit: " "
+
+-- long hex strings to test SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || repeat('baadf00d', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || '   ' || repeat('baad f00d', 8))::bytea; -- hex string with whitespaces
+
+-- errors checking for SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'badf00d' || repeat('DEADC0DE', 4))::bytea; -- odd number of hex digits
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'baadfood'|| repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "o"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'C00LC0FFEE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'C00LC*DE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "*"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'bad f00d' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: " "
+
+SELECT encode(data, 'hex') FROM BYTEA_TABLE;
+
+DROP TABLE BYTEA_TABLE;
-- 
2.34.1

v5-0001-SVE-support-for-hex-coding.patchapplication/octet-stream; name=v5-0001-SVE-support-for-hex-coding.patchDownload

From 5a9bc0e99f7ae102c11cc905cd6c4df6016c415d Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Sun, 6 Jul 2025 19:35:46 +0530
Subject: [PATCH v5] SVE support for hex coding

---
 config/c-compiler.m4                   |  85 ++++++++
 configure                              | 104 +++++++++
 configure.ac                           |   9 +
 meson.build                            |  81 +++++++
 src/backend/utils/adt/Makefile         |   1 +
 src/backend/utils/adt/encode.c         |   6 +-
 src/backend/utils/adt/encode_aarch64.c | 280 +++++++++++++++++++++++++
 src/backend/utils/adt/meson.build      |   1 +
 src/include/pg_config.h.in             |   3 +
 src/include/utils/builtins.h           |  51 ++++-
 10 files changed, 615 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/adt/encode_aarch64.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index da40bd6a647..73d12826698 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -798,3 +798,88 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SVE_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the SVE intrinsic required for hex coding:
+# svsub_x, svcmplt, svsel, svcmpgt, svtbl, svlsr_x, svand_z, svcreate2,
+# svptest_any, svnot_z, svorr_z, svcntb, svld1, svwhilelt_b8, svst2, svld2,
+# svget2, svst1 and svlsl_x.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_hex_intrinsics])])dnl
+AC_CACHE_CHECK([for svtbl, svlsr_x, svand_z, svcreate2, etc], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+
+    char input@<:@64@:>@;
+    char output@<:@128@:>@;
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+    {
+      svuint8_t	digit = svsub_x(pred, vec, 48),
+                upper = svsub_x(pred, vec, 55),
+                lower = svsub_x(pred, vec, 87);
+      svbool_t	valid_digit = svcmplt(pred, digit, 10),
+                valid_upper = svcmplt(pred, upper, 16);
+      svuint8_t	letter = svsel(valid_upper, upper, lower);
+      svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+                                            svcmplt(pred, letter, 16));
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+        return 0;
+      *res = svsel(valid_digit, digit, letter);
+      return 1;
+    }
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+      const char	*hextbl = "0123456789abcdef";
+      svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+      char *src = input, *dst = output;
+
+      /* hex encode */
+      for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+      {
+        svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+        svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                  high = svlsr_x(pred, bytes, 4),
+                  low = svand_z(pred, bytes, 0xF);
+        svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+        svst2(pred, (uint8_t *) dst, merged);
+      }
+
+      /* hex decode */
+      len = 128;
+
+      for (int i; i < len; i += vec_len_x2)
+      {
+        svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+        svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+        svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+        if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+          break;
+        if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+          break;
+
+        svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+      }
+
+      /* return computed value, to prevent the above being optimized away */
+      return output@<:@0@:>@;
+    }],
+  [return hex_coding_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_HEX_INTRINSICS
diff --git a/configure b/configure
index 16ef5b58d1a..df78a5408d3 100755
--- a/configure
+++ b/configure
@@ -17851,6 +17851,110 @@ $as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svtbl, svlsr_x, svand_z, svcreate2, etc" >&5
+$as_echo_n "checking for svtbl, svlsr_x, svand_z, svcreate2, etc... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+    char input[64];
+    char output[128];
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+    {
+      svuint8_t	digit = svsub_x(pred, vec, 48),
+                upper = svsub_x(pred, vec, 55),
+                lower = svsub_x(pred, vec, 87);
+      svbool_t	valid_digit = svcmplt(pred, digit, 10),
+                valid_upper = svcmplt(pred, upper, 16);
+      svuint8_t	letter = svsel(valid_upper, upper, lower);
+      svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+                                            svcmplt(pred, letter, 16));
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+        return 0;
+      *res = svsel(valid_digit, digit, letter);
+      return 1;
+    }
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+      const char	*hextbl = "0123456789abcdef";
+      svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+      char *src = input, *dst = output;
+
+      /* hex encode */
+      for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+      {
+        svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+        svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                  high = svlsr_x(pred, bytes, 4),
+                  low = svand_z(pred, bytes, 0xF);
+        svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+        svst2(pred, (uint8_t *) dst, merged);
+      }
+
+      /* hex decode */
+      len = 128;
+
+      for (int i; i < len; i += vec_len_x2)
+      {
+        svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+        svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+        svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+        if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+          break;
+        if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+          break;
+
+        svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+      }
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0];
+    }
+int
+main ()
+{
+return hex_coding_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_HEX_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index b3efc49c97a..ce0015bb543 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2107,6 +2107,15 @@ if test x"$host_cpu" = x"aarch64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_HEX_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index a97854a947d..68700a4bba0 100644
--- a/meson.build
+++ b/meson.build
@@ -2389,6 +2389,87 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+char input[64];
+char output[128];
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	svuint8_t	digit = svsub_x(pred, vec, 48),
+				    upper = svsub_x(pred, vec, 55),
+				    lower = svsub_x(pred, vec, 87);
+	svbool_t	valid_digit = svcmplt(pred, digit, 10),
+            valid_upper = svcmplt(pred, upper, 16);
+	svuint8_t	letter = svsel(valid_upper, upper, lower);
+	svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+							  				                 svcmplt(pred, letter, 16));
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+		return 0;
+	*res = svsel(valid_digit, digit, letter);
+	return 1;
+}
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+    const char	hextbl[] = "0123456789abcdef";
+    svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+    char *src = input, *dst = output;
+
+    /* hex encode */
+    for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+    {
+      svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+      svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                high = svlsr_x(pred, bytes, 4),
+                low = svand_z(pred, bytes, 0xF);
+      svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+      svst2(pred, (uint8_t *) dst, merged);
+    }
+
+    /* hex decode */
+    len = 128;
+
+    for (int i; i < len; i += vec_len_x2)
+    {
+      svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+      svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+      svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+      if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+        break;
+      if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+        break;
+
+      svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+    }
+    
+    /* return computed value, to prevent the above being optimized away */
+    return output[0];
+}
+'''
+
+  if cc.links(prog, name: 'SVE hex coding', args: test_c_args)
+    cdata.set('USE_SVE_HEX_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index ffeacf2b819..d2fa03efe98 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -33,6 +33,7 @@ OBJS = \
 	dbsize.o \
 	domains.o \
 	encode.o \
+	encode_aarch64.o \
 	enum.o \
 	expandeddatum.o \
 	expandedrecord.o \
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d1..fa62ce3107d 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -178,7 +178,7 @@ static const int8 hexlookup[128] = {
 };
 
 uint64
-hex_encode(const char *src, size_t len, char *dst)
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -208,13 +208,13 @@ get_hex(const char *cp, char *out)
 }
 
 uint64
-hex_decode(const char *src, size_t len, char *dst)
+hex_decode_scalar(const char *src, size_t len, char *dst)
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
 uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
diff --git a/src/backend/utils/adt/encode_aarch64.c b/src/backend/utils/adt/encode_aarch64.c
new file mode 100644
index 00000000000..bf8157900f8
--- /dev/null
+++ b/src/backend/utils/adt/encode_aarch64.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * encode_aarch64.c
+ *	  Holds the SVE hex encode/decode implementations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/encode_aarch64.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <c.h>
+
+#include "utils/builtins.h"
+
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * These are the SVE implementations of the hex encode/decode functions.
+ */
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext);
+
+/*
+ * The function pointers are initially set to "choose" functions.  These
+ * functions will first set the pointers to the right implementations (based on
+ * what the current CPU supports) and then will call the pointer to fulfill the
+ * caller's request.
+ */
+
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext);
+uint64 		(*hex_encode_optimized) (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 		(*hex_decode_optimized) (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 		(*hex_decode_safe_optimized) (const char *src, size_t len, char *dst, Node *escontext) = hex_decode_safe_choose;
+
+static inline bool
+check_sve_support(void)
+{
+#ifdef HAVE_ELF_AUX_INFO
+	unsigned long value;
+
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL)
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_scalar;
+		hex_decode_optimized = hex_decode_scalar;
+		hex_decode_safe_optimized = hex_decode_safe_scalar;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	uint32 		vec_len = svcntb();
+	svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svbool_t	pred = svptrue_b8();
+	size_t		loop_bytes = len & ~(2 * vec_len - 1); /* process 2 * vec_len byte chunk each iteration */
+	svuint8_t	bytes, high, low;
+	svuint8x2_t	zipped;
+
+	for (size_t i = 0; i < loop_bytes; i += 2 * vec_len)
+	{
+		bytes = svld1(pred, (uint8 *) src);
+		
+		/* Right-shift to obtain the high nibble */
+		high = svlsr_x(pred, bytes, 4);
+
+		/* Mask the high nibble to obtain the low nibble */
+		low = svand_z(pred, bytes, 0xF);
+
+		/*
+		 * Convert the high and low nibbles to hexadecimal digits using a
+		 * vectorized table lookup and zip (interleave) the hexadecimal digits.
+		 */
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+
+		/* unrolled */
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_x(pred, bytes, 4);
+		low = svand_z(pred, bytes, 0xF);
+
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	/* process remaining tail bytes */
+	for (size_t i = loop_bytes; i < len; i += vec_len)
+	{
+		pred = svwhilelt_b8((uint64) i, (uint64) len);
+		bytes = svld1(pred, (uint8 *) src);
+		high = svlsr_x(pred, bytes, 4);
+		low = svand_z(pred, bytes, 0xF);
+
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+/*
+ * get_hex_sve
+ *      Returns true if the hexadecimal digits are successfully converted
+ *      to nibbles and stored in 'res'; otherwise, returns false.
+ */
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	/*
+	 * Convert ASCII of '0'-'9' to integers 0-9 by subtracting 48 (ASCII of '0').
+	 * Similarly, convert letters 'A'–'F' and 'a'–'f' to integers 10–15 by
+	 * subtracting 55 ('A' - 10) and 87 ('a' - 10).
+	 */
+	svuint8_t	digit = svsub_x(pred, vec, '0'),
+				upper = svsub_x(pred, vec, 'A' - 10),
+				lower = svsub_x(pred, vec, 'a' - 10);
+
+	/*
+	 * Identify valid values in digits, upper, and lower vectors.
+	 * Values 0-9 are valid in digits, while values 10-15 are valid
+	 * in upper and lower.
+	 *
+	 * Example:
+	 * 		vec: 				'0'  '9'  'A'  'F'  'a'  'f'
+	 * 		vec (in ASCII):		48   57   65   70   97   102
+	 *
+	 * 		digit:	 			0    9    17   22   49   54
+	 * 		valid_digit:		1	 1	   0	0	 0	  0
+	 *
+	 * 		upper:				249  2    10   15   42   47
+	 * 		valid_upper:		0	 1	   1	1	 0	  0
+	 *
+	 * 		lower:				217  226  234  239  10   15
+	 *
+	 * Note that values 0-9 are also marked valid in valid_upper, this will be
+	 * handled later.
+	 */
+	svbool_t	valid_digit = svcmplt(pred, digit, 10),
+				valid_upper = svcmplt(pred, upper, 16);
+
+	/*
+	 * Merge upper and lower vector using the logic: take the element from
+	 * upper if it's true in valid_upper else pick the element in lower
+	 *
+	 * Mark the valid range i.e. 10-15 in letter vector
+	 *
+	 * 		letter:				217  2    10   15   10   15
+	 * 		valid_letter:		0	 0	   1	1    1	  1
+	 */
+
+	svuint8_t	letter = svsel(valid_upper, upper, lower);
+	svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+											 svcmplt(pred, letter, 16));
+
+	/*
+	 * Check for invalid hexadecimal digit. Each value must fall within
+	 * the range 0-9 (true in valid_digit) or 10-15 (true in valid_letter) i.e.
+	 * the OR of valid_digit and valid_letter should be all true.
+	 */
+
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+		return false;
+
+	/*
+	 * Finally, combine digit and letter vectors using the logic:
+	 * take the element from digit if it's true in valid_digit else pick the
+	 * element in letter.
+	 * 
+	 * 		res:	 			0    9    10   15   10   15
+	 */
+
+	*res = svsel(valid_digit, digit, letter);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	uint32		vec_len = svcntb();
+	size_t		i = 0,
+				loop_bytes = len & ~(2 * vec_len - 1); /* process 2 * vec_len byte chunk each iteration */
+	svbool_t 	pred = svptrue_b8();
+	const char *p = dst;
+
+	while (i < loop_bytes)
+	{
+		svuint8x2_t bytes = svld2(pred, (uint8 *) src);
+		svuint8_t 	high = svget2(bytes, 0),
+				  	low = svget2(bytes, 1);
+
+		/* fallback for characters with ASCII values below '0' */
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+			break;
+
+		/* fallback if an invalid hexadecimal digit is found */
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+			break;
+
+		/* form the byte by left-shifting the high nibble and OR-ing it with the low nibble */
+		svst1(pred, (uint8 *) dst, svorr_z(pred, svlsl_x(pred, high, 4), low));
+
+		i += 2 * vec_len;
+		src += 2 * vec_len;
+		dst += vec_len;
+	}
+
+	if (len > i) /* fallback */
+		return dst - p + hex_decode_safe_scalar(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+
+#endif	/* USE_SVE_HEX_WITH_RUNTIME_CHECK */
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index ed9bbd7b926..094a9c7c013 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -22,6 +22,7 @@ backend_sources += files(
   'dbsize.c',
   'domains.c',
   'encode.c',
+  'encode_aarch64.c',
   'enum.c',
   'expandeddatum.c',
   'expandedrecord.c',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 726a7c1be1f..7a227f1875f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -675,6 +675,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_HEX_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1c98c7d2255..2f72d8df9d1 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -35,11 +35,56 @@ extern int	errdatatype(Oid datatypeOid);
 extern int	errdomainconstraint(Oid datatypeOid, const char *conname);
 
 /* encode.c */
-extern uint64 hex_encode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode_safe(const char *src, size_t len, char *dst,
+extern uint64 hex_encode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_safe_scalar(const char *src, size_t len, char *dst,
 							  Node *escontext);
 
+/*
+ * On AArch64, we can try to use an SVE optimized hex encode/decode on some systems.
+ */
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*hex_encode_optimized) (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_optimized) (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_safe_optimized) (const char *src, size_t len, char *dst, Node *escontext);
+#endif
+
+static inline uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+	int	threshold = 16;
+
+	if (len >= threshold)
+		return hex_encode_optimized(src, len, dst);
+#endif
+	return hex_encode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_optimized(src, len, dst);
+#endif
+	return hex_decode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_safe_optimized(src, len, dst, escontext);
+#endif
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+}
+
 /* int.c */
 extern int2vector *buildint2vector(const int16 *int2s, int n);
 
-- 
2.34.1

#28

Chiranmoy.Bhattacharya@fujitsu.com

4 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#27)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Hi all,

Since the CommitFest is underway, could we get some feedback to improve the patch?

_______
Chiranmoy

#29

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#28)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Sep 03, 2025 at 11:11:24AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Since the CommitFest is underway, could we get some feedback to improve
the patch?

I see that there was some discussion about a Neon implementation upthread,
but I'm not sure we concluded anything. For popcount, we first added a
Neon version before adding the SVE version, which required more complicated
configure/runtime checks. Presumably Neon is available on more hardware
than SVE, so that could be a good place to start here, too.

Also, I'd strongly encourage you to get involved with others' patches on
the mailing lists (e.g., reviewing, testing). Patch submissions are great,
but this community depends on other types of participation, too. IME
helping others with their patches also tends to incentivize others to help
with yours.

--
nathan

#30

John Naylor

johncnaylorls@gmail.com

4 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#28)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Sep 3, 2025 at 6:11 PM Chiranmoy.Bhattacharya@fujitsu.com
<Chiranmoy.Bhattacharya@fujitsu.com> wrote:

Hi all,

Since the CommitFest is underway, could we get some feedback to improve the patch?

On that note, I was hoping you could give us feedback on whether the
improvement in PG18 made any difference at all in your real-world
use-case, i.e. not just in a microbenchmark, but also including
transmission of the hex-encoded values across the network to the
client (that I assume must decode them again).

--
John Naylor
Amazon Web Services

#31

Chiranmoy.Bhattacharya@fujitsu.com

4 months ago

In reply to: John Naylor (#30)

8 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

I see that there was some discussion about a Neon implementation upthread,
but I'm not sure we concluded anything. For popcount, we first added a
Neon version before adding the SVE version, which required more complicated
configure/runtime checks. Presumably Neon is available on more hardware
than SVE, so that could be a good place to start here, too.

We have added the Neon versions of hex encode/decode.
Here are the microbenchmark numbers.

hex_encode - m7g.4xlarge
Input | Head | Neon
-------+--------+--------
32 | 18.056 | 5.957
40 | 22.127 | 10.205
48 | 26.214 | 14.151
64 | 33.613 | 6.164
128 | 66.060 | 11.372
256 |130.225 | 18.543
512 |267.105 | 33.977
1024 |515.603 | 64.462

hex_decode - m7g.4xlarge
Input | Head | Neon
-------+--------+--------
32 | 26.669 | 9.462
40 | 36.320 | 19.347
48 | 45.971 | 19.099
64 | 58.468 | 17.648
128 |113.250 | 30.437
256 |218.743 | 56.824
512 |414.133 |107.212
1024 |828.493 |210.740

Also, I'd strongly encourage you to get involved with others' patches on
the mailing lists (e.g., reviewing, testing). Patch submissions are great,
but this community depends on other types of participation, too. IME
helping others with their patches also tends to incentivize others to help
with yours.

Sure, we will try to test/review patches on areas we have experience.

On that note, I was hoping you could give us feedback on whether the
improvement in PG18 made any difference at all in your real-world
use-case, i.e. not just in a microbenchmark, but also including
transmission of the hex-encoded values across the network to the
client (that I assume must decode them again).

Yes, the improvement in v18 did help, check the attached perf graphs.
We used a python script to send and receive binary data from postgres.
For simple select queries on a bytea column, hex_encode was taking
42% of the query execution time in v17, this was reduced to 33% in v18,
resulting in around 18% improvement in overall query time.

The proposed patch further reduces the hex_encode function usage to
5.6%, another 25% improvement in total query time.

We observed similar improvements for insert queries on the bytea column.
hex_decode usage decreased from 15.5% to 5.5%, a 5-8% query level
improvement depending on which storage type is used.

------
Chiranmoy

Attachments:

v6-0001-NEON-support-for-hex-coding.patchapplication/octet-stream; name=v6-0001-NEON-support-for-hex-coding.patchDownload

From e642b6d32d4715c988b6b93d57385a7c0779182d Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Thu, 4 Sep 2025 12:23:24 +0530
Subject: [PATCH v6 1/3] NEON support for hex coding

---
 src/backend/utils/adt/Makefile         |   1 +
 src/backend/utils/adt/encode.c         |   6 +-
 src/backend/utils/adt/encode_aarch64.c | 195 +++++++++++++++++++++++++
 src/backend/utils/adt/meson.build      |   1 +
 src/include/utils/builtins.h           |  57 +++++++-
 5 files changed, 254 insertions(+), 6 deletions(-)
 create mode 100644 src/backend/utils/adt/encode_aarch64.c

diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index cc68ac545a5..40eaee14899 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -33,6 +33,7 @@ OBJS = \
 	dbsize.o \
 	domains.o \
 	encode.o \
+	encode_aarch64.o \
 	enum.o \
 	expandeddatum.o \
 	expandedrecord.o \
diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d1..fa62ce3107d 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -178,7 +178,7 @@ static const int8 hexlookup[128] = {
 };
 
 uint64
-hex_encode(const char *src, size_t len, char *dst)
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -208,13 +208,13 @@ get_hex(const char *cp, char *out)
 }
 
 uint64
-hex_decode(const char *src, size_t len, char *dst)
+hex_decode_scalar(const char *src, size_t len, char *dst)
 {
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
 uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
diff --git a/src/backend/utils/adt/encode_aarch64.c b/src/backend/utils/adt/encode_aarch64.c
new file mode 100644
index 00000000000..7b0412dc255
--- /dev/null
+++ b/src/backend/utils/adt/encode_aarch64.c
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * encode_aarch64.c
+ *	  Holds the AArch64 hex encode/decode implementations.
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/encode_aarch64.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <c.h>
+
+#include "utils/builtins.h"
+
+#ifdef HEX_CODING_AARCH64
+
+#include <arm_neon.h>
+
+uint64
+hex_encode_optimized(const char *src, size_t len, char *dst)
+{
+	const char		hextbl[] = "0123456789abcdef";
+	uint8x16_t		hextbl_vec = vld1q_u8((uint8 *) hextbl);
+	uint8x16x2_t	zipped;
+	uint32			vec_len = sizeof(uint8x16_t);
+	size_t			chunks_len = len & ~(2 * vec_len - 1);
+
+	for (size_t i = 0; i < chunks_len; i += 2 * vec_len)
+	{
+		uint8x16_t bytes = vld1q_u8((uint8 *) src);
+
+		/* Right-shift by 4 to get the high nibble */
+		uint8x16_t high = vshrq_n_u8(bytes, 4);
+
+		/* Mask the high nibble to get the low nibble */
+		uint8x16_t low = vandq_u8(bytes, vdupq_n_u8(0xF));
+
+		/*
+		* Convert the high and low nibbles to hexadecimal digits using a table
+		* lookup, and then zip (interleave) the resulting digits.
+		*/
+		zipped.val[0] = vqtbl1q_u8(hextbl_vec, high);
+		zipped.val[1] = vqtbl1q_u8(hextbl_vec, low);
+		vst2q_u8((uint8 *) dst, zipped);
+
+		src += vec_len;
+		dst += 2 * vec_len;
+
+		/* unrolled */
+		bytes = vld1q_u8((uint8 *) src);
+		high = vshrq_n_u8(bytes, 4);
+		low = vandq_u8(bytes, vdupq_n_u8(0xF));
+
+		zipped.val[0] = vqtbl1q_u8(hextbl_vec, high);
+		zipped.val[1] = vqtbl1q_u8(hextbl_vec, low);
+		vst2q_u8((uint8 *) dst, zipped);
+
+		src += vec_len;
+		dst += 2 * vec_len;
+	}
+
+
+
+	if (len > chunks_len)
+		hex_encode_scalar(src, len - chunks_len, dst);
+
+	return (uint64) len * 2;
+}
+
+/*
+ * get_hex_neon
+ *      Returns true if the hexadecimal digits are successfully converted
+ *      to nibbles and stored in 'res'; otherwise, returns false.
+ */
+static inline bool
+get_hex_neon(uint8x16_t vec, uint8x16_t *res)
+{
+	/*
+	 * Convert ASCII of '0'-'9' to integers 0-9 by subtracting 48 (ASCII of '0').
+	 * Similarly, convert letters 'A'–'F' and 'a'–'f' to integers 10–15 by
+	 * subtracting 55 ('A' - 10) and 87 ('a' - 10).
+	 */
+	uint8x16_t digit = vsubq_u8(vec, vdupq_n_u8('0'));
+	uint8x16_t upper = vsubq_u8(vec, vdupq_n_u8('A' - 10));
+	uint8x16_t lower = vsubq_u8(vec, vdupq_n_u8('a' - 10));
+
+	/*
+	 * Identify valid values in digit, upper, and lower vectors.
+	 * Values 0-9 are valid in digits, while values 10-15 are valid
+	 * in upper and lower.
+	 *
+	 * Example:
+	 * 		vec: 				'0'  '9'  'A'  'F'  'a'  'f'
+	 * 		vec (in ASCII):		48   57   65   70   97   102
+	 *
+	 * 		digit:	 			0    9    17   22   49   54
+	 * 		valid_digit:		1	 1	   0	0	 0	  0
+	 *
+	 * 		upper:				249  2    10   15   42   47
+	 * 		valid_upper:		0	 1	   1	1	 0	  0
+	 *
+	 * 		lower:				217  226  234  239  10   15
+	 *
+	 * Note that values 0-9 are also marked valid in valid_upper, this will be
+	 * handled later.
+	 */
+
+	uint8x16_t	valid_digit = vcltq_u8(digit, vdupq_n_u8(10));
+	uint8x16_t	valid_upper = vcltq_u8(upper, vdupq_n_u8(16));
+
+	/*
+	 * Merge upper and lower vector using the logic: pick the element from
+	 * upper if it's true in valid_upper else pick the element in lower
+	 *
+	 * Mark the valid range i.e. 10-15 in letter vector
+	 *
+	 * 		letter:				217  2    10   15   10   15
+	 * 		valid_letter:		0	 0	   1	1    1	  1
+	 */
+
+	uint8x16_t 	letter = vbslq_u8(valid_upper, upper, lower);
+	uint8x16_t	valid_letter = vandq_u8(vcgtq_u8(letter, vdupq_n_u8(9)),
+										vcltq_u8(letter, vdupq_n_u8(16)));
+
+	/*
+	 * Check for invalid hexadecimal digit. Each value must fall within
+	 * the range 0-9 (true in valid_digit) or 10-15 (true in valid_letter) i.e.
+	 * the OR of valid_digit and valid_letter should be all true.
+	 */
+	uint8x16_t invalid_mask = vmvnq_u8(vorrq_u8(valid_digit, valid_letter));
+
+	if (vmaxvq_u8(invalid_mask) != 0)
+		return false;
+
+	/*
+	 * Finally, combine digit and letter vectors using the logic:
+	 * pick the element from digit if it's true in valid_digit else pick the
+	 * element in letter.
+	 *
+	 *  	res:	 			0    9    10   15   10   15
+	 */
+	*res = vbslq_u8(valid_digit, digit, letter);
+	return true;
+}
+
+uint64
+hex_decode_optimized(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_optimized(src, len, dst, NULL);
+}
+
+uint64
+hex_decode_safe_optimized(const char *src, size_t len, char *dst, Node *escontext)
+{
+	uint32		vec_len = sizeof(uint8x16_t);
+	size_t		i = 0;
+	size_t 		chunks_len = len & ~(2 * vec_len - 1); /* process 2 x vec_len per iteration */
+	uint8x16_t	ascii_zero = vdupq_n_u8('0');
+	const char *p = dst;
+
+	while (i < chunks_len)
+	{
+		uint8x16x2_t bytes = vld2q_u8((uint8 *) src);
+		uint8x16_t high = bytes.val[0];	/* hex digits for high nibble */
+		uint8x16_t low = bytes.val[1];	/* hex digits for low nibble */
+
+		/* fallback for characters with ASCII values below '0' */
+		uint8x16_t is_below_zero = vorrq_u8(vcltq_u8(high, ascii_zero),
+											vcltq_u8(low, ascii_zero));
+		if (vmaxvq_u8(is_below_zero) != 0)
+			break;
+
+		/* fallback if an invalid hexadecimal digit is found */
+		if (!get_hex_neon(high, &high) || !get_hex_neon(low, &low))
+			break;
+
+		/* form the byte by left-shifting the high nibble and OR-ing it with the low nibble */
+		vst1q_u8((uint8 *) dst, vorrq_u8(vshlq_n_u8(high, 4), low));
+
+		i += 2 * vec_len;
+		src += 2 * vec_len;
+		dst += vec_len;
+	}
+
+	if (len > i) /* fallback */
+		return dst - p + hex_decode_safe_scalar(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+
+#endif /* HEX_CODING_AARCH64 */
diff --git a/src/backend/utils/adt/meson.build b/src/backend/utils/adt/meson.build
index dac372c3bea..8b106d03d33 100644
--- a/src/backend/utils/adt/meson.build
+++ b/src/backend/utils/adt/meson.build
@@ -22,6 +22,7 @@ backend_sources += files(
   'dbsize.c',
   'domains.c',
   'encode.c',
+  'encode_aarch64.c',
   'enum.c',
   'expandeddatum.c',
   'expandedrecord.c',
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1c98c7d2255..a809fad2771 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -35,11 +35,62 @@ extern int	errdatatype(Oid datatypeOid);
 extern int	errdomainconstraint(Oid datatypeOid, const char *conname);
 
 /* encode.c */
-extern uint64 hex_encode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode(const char *src, size_t len, char *dst);
-extern uint64 hex_decode_safe(const char *src, size_t len, char *dst,
+extern uint64 hex_encode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_scalar(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_safe_scalar(const char *src, size_t len, char *dst,
 							  Node *escontext);
 
+/*
+ * On AArch64, we can use Neon instructions if the compiler provides access to
+ * them (as indicated by __ARM_NEON). As in simd.h, we assume that all
+ * available 64-bit hardware has Neon support.
+ */
+#if defined(__aarch64__) && defined(__ARM_NEON)
+#define HEX_CODING_AARCH64 1
+#endif
+
+#ifdef HEX_CODING_AARCH64
+extern uint64 hex_encode_optimized(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_optimized(const char *src, size_t len, char *dst);
+extern uint64 hex_decode_safe_optimized(const char *src, size_t len, char *dst, Node *escontext);
+#endif
+
+static inline uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef HEX_CODING_AARCH64
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_encode_optimized(src, len, dst);
+#endif
+	return hex_encode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode(const char *src, size_t len, char *dst)
+{
+#ifdef HEX_CODING_AARCH64
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_optimized(src, len, dst);
+#endif
+	return hex_decode_scalar(src, len, dst);
+}
+
+static inline uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef HEX_CODING_AARCH64
+	int	threshold = 32;
+
+	if (len >= threshold)
+		return hex_decode_safe_optimized(src, len, dst, escontext);
+#endif
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+}
+
 /* int.c */
 extern int2vector *buildint2vector(const int16 *int2s, int n);
 
-- 
2.34.1

v6-0002-SVE-support-for-hex-coding.patchapplication/octet-stream; name=v6-0002-SVE-support-for-hex-coding.patchDownload

From 7330548cc5b5ebdbe5c8fa515f1eb2eebfc7f2c3 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Thu, 4 Sep 2025 15:33:47 +0530
Subject: [PATCH v6 2/3] SVE support for hex coding

---
 config/c-compiler.m4                   |  85 +++++++++
 configure                              | 104 +++++++++++
 configure.ac                           |   9 +
 meson.build                            |  81 +++++++++
 src/backend/utils/adt/encode_aarch64.c | 231 ++++++++++++++++++++++++-
 src/include/pg_config.h.in             |   3 +
 src/include/utils/builtins.h           |  10 +-
 7 files changed, 520 insertions(+), 3 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index da40bd6a647..73d12826698 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -798,3 +798,88 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_SVE_POPCNT_INTRINSICS
+
+# PGAC_ARM_SVE_HEX_INTRINSICS
+# ------------------------------
+# Check if the compiler supports the SVE intrinsic required for hex coding:
+# svsub_x, svcmplt, svsel, svcmpgt, svtbl, svlsr_x, svand_z, svcreate2,
+# svptest_any, svnot_z, svorr_z, svcntb, svld1, svwhilelt_b8, svst2, svld2,
+# svget2, svst1 and svlsl_x.
+#
+# If the intrinsics are supported, sets pgac_arm_sve_hex_intrinsics.
+AC_DEFUN([PGAC_ARM_SVE_HEX_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_arm_sve_hex_intrinsics])])dnl
+AC_CACHE_CHECK([for svtbl, svlsr_x, svand_z, svcreate2, etc], [Ac_cachevar],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <arm_sve.h>
+
+    char input@<:@64@:>@;
+    char output@<:@128@:>@;
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+    {
+      svuint8_t	digit = svsub_x(pred, vec, 48),
+                upper = svsub_x(pred, vec, 55),
+                lower = svsub_x(pred, vec, 87);
+      svbool_t	valid_digit = svcmplt(pred, digit, 10),
+                valid_upper = svcmplt(pred, upper, 16);
+      svuint8_t	letter = svsel(valid_upper, upper, lower);
+      svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+                                            svcmplt(pred, letter, 16));
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+        return 0;
+      *res = svsel(valid_digit, digit, letter);
+      return 1;
+    }
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+      const char	*hextbl = "0123456789abcdef";
+      svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+      char *src = input, *dst = output;
+
+      /* hex encode */
+      for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+      {
+        svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+        svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                  high = svlsr_x(pred, bytes, 4),
+                  low = svand_z(pred, bytes, 0xF);
+        svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+        svst2(pred, (uint8_t *) dst, merged);
+      }
+
+      /* hex decode */
+      len = 128;
+
+      for (int i; i < len; i += vec_len_x2)
+      {
+        svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+        svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+        svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+        if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+          break;
+        if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+          break;
+
+        svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+      }
+
+      /* return computed value, to prevent the above being optimized away */
+      return output@<:@0@:>@;
+    }],
+  [return hex_coding_test();])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])])
+if test x"$Ac_cachevar" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_ARM_SVE_HEX_INTRINSICS
diff --git a/configure b/configure
index 39c68161cec..60354107f87 100755
--- a/configure
+++ b/configure
@@ -17735,6 +17735,110 @@ $as_echo "#define USE_SVE_POPCNT_WITH_RUNTIME_CHECK 1" >>confdefs.h
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for svtbl, svlsr_x, svand_z, svcreate2, etc" >&5
+$as_echo_n "checking for svtbl, svlsr_x, svand_z, svcreate2, etc... " >&6; }
+if ${pgac_cv_arm_sve_hex_intrinsics+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <arm_sve.h>
+
+    char input[64];
+    char output[128];
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+    {
+      svuint8_t	digit = svsub_x(pred, vec, 48),
+                upper = svsub_x(pred, vec, 55),
+                lower = svsub_x(pred, vec, 87);
+      svbool_t	valid_digit = svcmplt(pred, digit, 10),
+                valid_upper = svcmplt(pred, upper, 16);
+      svuint8_t	letter = svsel(valid_upper, upper, lower);
+      svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+                                            svcmplt(pred, letter, 16));
+      if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+        return 0;
+      *res = svsel(valid_digit, digit, letter);
+      return 1;
+    }
+
+    #if defined(__has_attribute) && __has_attribute (target)
+        __attribute__((target("arch=armv8-a+sve")))
+    #endif
+    static int hex_coding_test(void)
+    {
+      int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+      const char	*hextbl = "0123456789abcdef";
+      svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+      char *src = input, *dst = output;
+
+      /* hex encode */
+      for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+      {
+        svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+        svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                  high = svlsr_x(pred, bytes, 4),
+                  low = svand_z(pred, bytes, 0xF);
+        svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+        svst2(pred, (uint8_t *) dst, merged);
+      }
+
+      /* hex decode */
+      len = 128;
+
+      for (int i; i < len; i += vec_len_x2)
+      {
+        svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+        svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+        svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+        if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+          break;
+        if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+          break;
+
+        svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+      }
+
+      /* return computed value, to prevent the above being optimized away */
+      return output[0];
+    }
+int
+main ()
+{
+return hex_coding_test();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_arm_sve_hex_intrinsics=yes
+else
+  pgac_cv_arm_sve_hex_intrinsics=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_arm_sve_hex_intrinsics" >&5
+$as_echo "$pgac_cv_arm_sve_hex_intrinsics" >&6; }
+if test x"$pgac_cv_arm_sve_hex_intrinsics" = x"yes"; then
+  pgac_arm_sve_hex_intrinsics=yes
+fi
+
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+
+$as_echo "#define USE_SVE_HEX_WITH_RUNTIME_CHECK 1" >>confdefs.h
+
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm_crc32_u8 and _mm_crc32_u32" >&5
diff --git a/configure.ac b/configure.ac
index 066e3976c0a..6ca57b8c4a7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2136,6 +2136,15 @@ if test x"$host_cpu" = x"aarch64"; then
   fi
 fi
 
+# Check for ARM SVE intrinsics for hex coding
+#
+if test x"$host_cpu" = x"aarch64"; then
+  PGAC_ARM_SVE_HEX_INTRINSICS()
+  if test x"$pgac_arm_sve_hex_intrinsics" = x"yes"; then
+    AC_DEFINE(USE_SVE_HEX_WITH_RUNTIME_CHECK, 1, [Define to 1 to use ARM SVE intrinsic for hex coding.])
+  fi
+fi
+
 # Check for Intel SSE 4.2 intrinsics to do CRC calculations.
 #
 PGAC_SSE42_CRC32_INTRINSICS()
diff --git a/meson.build b/meson.build
index ab8101d67b2..ea5392cfc78 100644
--- a/meson.build
+++ b/meson.build
@@ -2372,6 +2372,87 @@ int main(void)
 endif
 
 
+###############################################################
+# Check the availability of SVE intrinsics for hex coding.
+###############################################################
+
+if host_cpu == 'aarch64'
+
+  prog = '''
+#include <arm_sve.h>
+
+char input[64];
+char output[128];
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	svuint8_t	digit = svsub_x(pred, vec, 48),
+				    upper = svsub_x(pred, vec, 55),
+				    lower = svsub_x(pred, vec, 87);
+	svbool_t	valid_digit = svcmplt(pred, digit, 10),
+            valid_upper = svcmplt(pred, upper, 16);
+	svuint8_t	letter = svsel(valid_upper, upper, lower);
+	svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+											                 svcmplt(pred, letter, 16));
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+		return 0;
+	*res = svsel(valid_digit, digit, letter);
+	return 1;
+}
+
+#if defined(__has_attribute) && __has_attribute (target)
+    __attribute__((target("arch=armv8-a+sve")))
+#endif
+int main(void)
+{
+    int len = 64, vec_len = svcntb(), vec_len_x2 = svcntb() * 2;
+    const char	hextbl[] = "0123456789abcdef";
+    svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8_t *) hextbl);
+    char *src = input, *dst = output;
+
+    /* hex encode */
+    for (uint64_t i = 0; i < 64; i += vec_len, dst += 2 * vec_len, src += vec_len)
+    {
+      svbool_t  pred = svwhilelt_b8((uint64_t) i, (uint64_t) len);
+      svuint8_t bytes = svld1(pred, (uint8_t *) src),
+                high = svlsr_x(pred, bytes, 4),
+                low = svand_z(pred, bytes, 0xF);
+      svuint8x2_t merged = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+      svst2(pred, (uint8_t *) dst, merged);
+    }
+
+    /* hex decode */
+    len = 128;
+
+    for (int i; i < len; i += vec_len_x2)
+    {
+      svbool_t 	  pred = svwhilelt_b8((uint64_t) i / 2, (uint64_t) len / 2);
+      svuint8x2_t bytes = svld2(pred, (uint8_t *) src + i);
+      svuint8_t 	high = svget2(bytes, 0), low = svget2(bytes, 1);
+
+      if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+        break;
+      if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+        break;
+
+      svst1(pred, (uint8_t *) dst + i / 2, svorr_z(pred, svlsl_x(pred, high, 4), low));
+    }
+
+    /* return computed value, to prevent the above being optimized away */
+    return output[0];
+}
+'''
+
+  if cc.links(prog, name: 'SVE hex coding', args: test_c_args)
+    cdata.set('USE_SVE_HEX_WITH_RUNTIME_CHECK', 1)
+  endif
+
+endif
+
+
 ###############################################################
 # Select CRC-32C implementation.
 #
diff --git a/src/backend/utils/adt/encode_aarch64.c b/src/backend/utils/adt/encode_aarch64.c
index 7b0412dc255..4f22fb1d6c7 100644
--- a/src/backend/utils/adt/encode_aarch64.c
+++ b/src/backend/utils/adt/encode_aarch64.c
@@ -20,8 +20,229 @@
 
 #include <arm_neon.h>
 
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+#include <arm_sve.h>
+
+#if defined(HAVE_ELF_AUX_INFO) || defined(HAVE_GETAUXVAL)
+#include <sys/auxv.h>
+#endif
+
+/*
+ * These are the NEON implementations of the hex encode/decode functions.
+ */
+static uint64 hex_encode_neon(const char *src, size_t len, char *dst);
+static uint64 hex_decode_neon(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_neon(const char *src, size_t len, char *dst, Node *escontext);
+
+/*
+ * These are the SVE implementations of the hex encode/decode functions.
+ */
+static uint64 hex_encode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_sve(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext);
+
+/*
+ * The function pointers are initially set to "choose" functions.  These
+ * functions will first set the pointers to the right implementations (based on
+ * what the current CPU supports) and then will call the pointer to fulfill the
+ * caller's request.
+ */
+
+static uint64 hex_encode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_choose(const char *src, size_t len, char *dst);
+static uint64 hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext);
+uint64 		(*hex_encode_optimized) (const char *src, size_t len, char *dst) = hex_encode_choose;
+uint64 		(*hex_decode_optimized) (const char *src, size_t len, char *dst) = hex_decode_choose;
+uint64 		(*hex_decode_safe_optimized) (const char *src, size_t len, char *dst, Node *escontext) = hex_decode_safe_choose;
+
+static inline bool
+check_sve_support(void)
+{
+#ifdef HAVE_ELF_AUX_INFO
+	unsigned long value;
+
+	return elf_aux_info(AT_HWCAP, &value, sizeof(value)) == 0 &&
+		(value & HWCAP_SVE) != 0;
+#elif defined(HAVE_GETAUXVAL)
+	return (getauxval(AT_HWCAP) & HWCAP_SVE) != 0;
+#else
+	return false;
+#endif
+}
+
+static inline void
+choose_hex_functions(void)
+{
+	if (check_sve_support())
+	{
+		hex_encode_optimized = hex_encode_sve;
+		hex_decode_optimized = hex_decode_sve;
+		hex_decode_safe_optimized = hex_decode_safe_sve;
+	}
+	else
+	{
+		hex_encode_optimized = hex_encode_neon;
+		hex_decode_optimized = hex_decode_neon;
+		hex_decode_safe_optimized = hex_decode_safe_neon;
+	}
+}
+
+static uint64
+hex_encode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_encode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_choose(const char *src, size_t len, char *dst)
+{
+	choose_hex_functions();
+	return hex_decode_optimized(src, len, dst);
+}
+
+static uint64
+hex_decode_safe_choose(const char *src, size_t len, char *dst, Node *escontext)
+{
+	choose_hex_functions();
+	return hex_decode_safe_optimized(src, len, dst, escontext);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_encode_sve(const char *src, size_t len, char *dst)
+{
+	const char	hextbl[] = "0123456789abcdef";
+	uint32 		vec_len = svcntb();
+	svuint8_t	hextbl_vec = svld1(svwhilelt_b8(0, 16), (uint8 *) hextbl);
+	svbool_t	pred_true = svptrue_b8();
+	size_t		chunks_len = len & ~(2 * vec_len - 1); /* process 2 * vec_len bytes per iteration */
+
+	for (size_t i = 0; i < chunks_len; i += 2 * vec_len)
+	{
+		svuint8_t	bytes = svld1(pred_true, (uint8 *) src);
+		svuint8_t	high = svlsr_x(pred_true, bytes, 4);
+		svuint8_t	low = svand_z(pred_true, bytes, 0xF);
+		svuint8x2_t	zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred_true, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+
+		/* unrolled */
+		bytes = svld1(pred_true, (uint8 *) src);
+		high = svlsr_x(pred_true, bytes, 4);
+		low = svand_z(pred_true, bytes, 0xF);
+		zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred_true, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	/* process remaining tail bytes */
+	for (size_t i = chunks_len; i < len; i += vec_len)
+	{
+		svbool_t	pred = svwhilelt_b8((uint64) i, (uint64) len);
+		svuint8_t	bytes = svld1(pred, (uint8 *) src);
+		svuint8_t	high = svlsr_x(pred, bytes, 4);
+		svuint8_t	low = svand_z(pred, bytes, 0xF);
+		svuint8x2_t	zipped = svcreate2(svtbl(hextbl_vec, high), svtbl(hextbl_vec, low));
+		svst2(pred, (uint8 *) dst, zipped);
+
+		dst += 2 * vec_len;
+		src += vec_len;
+	}
+
+	return (uint64) len * 2;
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+static inline bool
+get_hex_sve(svbool_t pred, svuint8_t vec, svuint8_t *res)
+{
+	svuint8_t	digit = svsub_x(pred, vec, 48);
+	svuint8_t	upper = svsub_x(pred, vec, 55);
+	svuint8_t	lower = svsub_x(pred, vec, 87);
+
+	svbool_t	valid_digit = svcmplt(pred, digit, 10);
+	svbool_t	valid_upper = svcmplt(pred, upper, 16);
+
+	svuint8_t	letter = svsel(valid_upper, upper, lower);
+	svbool_t	valid_letter = svand_z(pred, svcmpgt(pred, letter, 9),
+											 svcmplt(pred, letter, 16));
+
+	if (svptest_any(pred, svnot_z(pred, svorr_z(pred, valid_digit, valid_letter))))
+		return false;
+
+	*res = svsel(valid_digit, digit, letter);
+	return true;
+}
+
+uint64
+hex_decode_sve(const char *src, size_t len, char *dst)
+{
+	return hex_decode_safe_sve(src, len, dst, NULL);
+}
+
+pg_attribute_target("arch=armv8-a+sve")
+uint64
+hex_decode_safe_sve(const char *src, size_t len, char *dst, Node *escontext)
+{
+	uint32		vec_len = svcntb();
+	size_t		i = 0;
+	size_t		chunks_len = len & ~(2 * vec_len - 1); /* process 2 * vec_len byte chunk each iteration */
+	svbool_t 	pred = svptrue_b8();
+	const char *p = dst;
+	bool		fallback = false;
+
+	while (i < chunks_len)
+	{
+		svuint8x2_t bytes = svld2(pred, (uint8 *) src);
+		svuint8_t 	high = svget2(bytes, 0);
+		svuint8_t  	low = svget2(bytes, 1);
+
+		if (svptest_any(pred, svorr_z(pred, svcmplt(pred, high, '0'), svcmplt(pred, low, '0'))))
+		{
+			fallback = true;
+			break;
+		}
+
+		if (!get_hex_sve(pred, high, &high) || !get_hex_sve(pred, low, &low))
+		{
+			fallback = true;
+			break;
+		}
+
+		svst1(pred, (uint8 *) dst, svorr_z(pred, svlsl_x(pred, high, 4), low));
+
+		i += 2 * vec_len;
+		src += 2 * vec_len;
+		dst += vec_len;
+	}
+
+	if (len > i && !fallback) /* can use neon for smaller chunks */
+		return dst - p + hex_decode_safe_neon(src, len - i, dst, escontext);
+
+	if (fallback) /* fallback */
+		return dst - p + hex_decode_safe_scalar(src, len - i, dst, escontext);
+
+	return dst - p;
+}
+
+#endif	/* USE_SVE_HEX_WITH_RUNTIME_CHECK */
+
+/*
+ * If the compiler supports SVE, rename the NEON versions because the
+ * optimized versions are now referenced via function pointers.
+ */
+
 uint64
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+static hex_encode_neon(const char *src, size_t len, char *dst)
+#else
 hex_encode_optimized(const char *src, size_t len, char *dst)
+#endif
 {
 	const char		hextbl[] = "0123456789abcdef";
 	uint8x16_t		hextbl_vec = vld1q_u8((uint8 *) hextbl);
@@ -63,8 +284,6 @@ hex_encode_optimized(const char *src, size_t len, char *dst)
 		dst += 2 * vec_len;
 	}
 
-
-
 	if (len > chunks_len)
 		hex_encode_scalar(src, len - chunks_len, dst);
 
@@ -148,13 +367,21 @@ get_hex_neon(uint8x16_t vec, uint8x16_t *res)
 }
 
 uint64
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+static hex_decode_neon(const char *src, size_t len, char *dst)
+#else
 hex_decode_optimized(const char *src, size_t len, char *dst)
+#endif
 {
 	return hex_decode_safe_optimized(src, len, dst, NULL);
 }
 
 uint64
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+static hex_decode_safe_neon(const char *src, size_t len, char *dst, Node *escontext)
+#else
 hex_decode_safe_optimized(const char *src, size_t len, char *dst, Node *escontext)
+#endif
 {
 	uint32		vec_len = sizeof(uint8x16_t);
 	size_t		i = 0;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c4dc5d72bdb..a6735bdd21f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -678,6 +678,9 @@
 /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */
 #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
 
+/* Define to 1 to use SVE instructions for hex coding with a runtime check. */
+#undef USE_SVE_HEX_WITH_RUNTIME_CHECK
+
 /* Define to 1 to build with Bonjour support. (--with-bonjour) */
 #undef USE_BONJOUR
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index a809fad2771..8a80a9ae51f 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -49,7 +49,15 @@ extern uint64 hex_decode_safe_scalar(const char *src, size_t len, char *dst,
 #define HEX_CODING_AARCH64 1
 #endif
 
-#ifdef HEX_CODING_AARCH64
+/*
+ * We can try to use an SVE-optimized hex encode/decode on systems supporting SVE.
+ * For that, we use a function pointer.
+ */
+#ifdef USE_SVE_HEX_WITH_RUNTIME_CHECK
+extern PGDLLIMPORT uint64 (*hex_encode_optimized) (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_optimized) (const char *src, size_t len, char *dst);
+extern PGDLLIMPORT uint64 (*hex_decode_safe_optimized) (const char *src, size_t len, char *dst, Node *escontext);
+#elif HEX_CODING_AARCH64
 extern uint64 hex_encode_optimized(const char *src, size_t len, char *dst);
 extern uint64 hex_decode_optimized(const char *src, size_t len, char *dst);
 extern uint64 hex_decode_safe_optimized(const char *src, size_t len, char *dst, Node *escontext);
-- 
2.34.1

v6-0003-Regression-tests-for-SIMD-hex-coding.patchapplication/octet-stream; name=v6-0003-Regression-tests-for-SIMD-hex-coding.patchDownload

From 2fdd7f1170253984c1b065ac3a0fc43a31997c05 Mon Sep 17 00:00:00 2001
From: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Date: Thu, 4 Sep 2025 15:44:19 +0530
Subject: [PATCH v6 3/3] Regression tests for SIMD hex coding

---
 src/test/regress/expected/hex_coding.out | 63 ++++++++++++++++++++++++
 src/test/regress/parallel_schedule       |  5 ++
 src/test/regress/sql/hex_coding.sql      | 39 +++++++++++++++
 3 files changed, 107 insertions(+)
 create mode 100644 src/test/regress/expected/hex_coding.out
 create mode 100644 src/test/regress/sql/hex_coding.sql

diff --git a/src/test/regress/expected/hex_coding.out b/src/test/regress/expected/hex_coding.out
new file mode 100644
index 00000000000..e6d78fa4876
--- /dev/null
+++ b/src/test/regress/expected/hex_coding.out
@@ -0,0 +1,63 @@
+--
+-- tests for hex_encode and hex_decode in encode.c
+--
+-- Build table for testing
+CREATE TABLE BYTEA_TABLE(data BYTEA);
+-- hex_decode is used for inserting into bytea column
+-- Set bytea_output to hex so that hex_encode is used and tested
+SET bytea_output = 'hex';
+INSERT INTO BYTEA_TABLE VALUES ('\xAB');
+INSERT INTO BYTEA_TABLE VALUES ('\x01ab');
+INSERT INTO BYTEA_TABLE VALUES ('\xDEADC0DE');
+INSERT INTO BYTEA_TABLE VALUES ('\xbaadf00d');
+INSERT INTO BYTEA_TABLE VALUES ('\x C001   c0ffee  '); -- hex string with whitespaces
+-- errors checking
+INSERT INTO BYTEA_TABLE VALUES ('\xbadf00d'); -- odd number of hex digits
+ERROR:  invalid hexadecimal data: odd number of digits
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xbadf00d');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xdeadcode'); -- invalid hexadecimal digit: "o"
+ERROR:  invalid hexadecimal digit: "o"
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xdeadcode');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC0FFEE'); -- invalid hexadecimal digit: "L"
+ERROR:  invalid hexadecimal digit: "L"
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xC00LC0FFEE');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC*DE'); -- invalid hexadecimal digit: "*"
+ERROR:  invalid hexadecimal digit: "L"
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xC00LC*DE');
+                                        ^
+INSERT INTO BYTEA_TABLE VALUES ('\xbad f00d'); -- invalid hexadecimal digit: " "
+ERROR:  invalid hexadecimal digit: " "
+LINE 1: INSERT INTO BYTEA_TABLE VALUES ('\xbad f00d');
+                                        ^
+-- long hex strings to test SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || repeat('baadf00d', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || '   ' || repeat('baad f00d', 8))::bytea; -- hex string with whitespaces
+-- errors checking for SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'badf00d' || repeat('DEADC0DE', 4))::bytea; -- odd number of hex digits
+ERROR:  invalid hexadecimal data: odd number of digits
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'baadfood'|| repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "o"
+ERROR:  invalid hexadecimal digit: "o"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'C00LC0FFEE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "L"
+ERROR:  invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'C00LC*DE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "*"
+ERROR:  invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'bad f00d' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: " "
+ERROR:  invalid hexadecimal digit: " "
+SELECT encode(data, 'hex') FROM BYTEA_TABLE;
+                                                              encode                                                              
+----------------------------------------------------------------------------------------------------------------------------------
+ ab
+ 01ab
+ deadc0de
+ baadf00d
+ c001c0ffee
+ deadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0de
+ deadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0debaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00d
+ deadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0dedeadc0debaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00dbaadf00d
+(8 rows)
+
+DROP TABLE BYTEA_TABLE;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index fbffc67ae60..876a3988ed0 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -109,6 +109,11 @@ test: select_views portals_p2 foreign_key cluster dependency guc bitmapops combo
 # ----------
 test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath sqljson sqljson_queryfuncs sqljson_jsontable
 
+# ----------
+# Another group of parallel tests for hex encode/decode
+# ----------
+test: hex_coding
+
 # ----------
 # Another group of parallel tests
 # with depends on create_misc
diff --git a/src/test/regress/sql/hex_coding.sql b/src/test/regress/sql/hex_coding.sql
new file mode 100644
index 00000000000..97c51b62e90
--- /dev/null
+++ b/src/test/regress/sql/hex_coding.sql
@@ -0,0 +1,39 @@
+--
+-- tests for hex_encode and hex_decode in encode.c
+--
+
+-- Build table for testing
+CREATE TABLE BYTEA_TABLE(data BYTEA);
+
+-- hex_decode is used for inserting into bytea column
+-- Set bytea_output to hex so that hex_encode is used and tested
+SET bytea_output = 'hex';
+
+INSERT INTO BYTEA_TABLE VALUES ('\xAB');
+INSERT INTO BYTEA_TABLE VALUES ('\x01ab');
+INSERT INTO BYTEA_TABLE VALUES ('\xDEADC0DE');
+INSERT INTO BYTEA_TABLE VALUES ('\xbaadf00d');
+INSERT INTO BYTEA_TABLE VALUES ('\x C001   c0ffee  '); -- hex string with whitespaces
+
+-- errors checking
+INSERT INTO BYTEA_TABLE VALUES ('\xbadf00d'); -- odd number of hex digits
+INSERT INTO BYTEA_TABLE VALUES ('\xdeadcode'); -- invalid hexadecimal digit: "o"
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC0FFEE'); -- invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE VALUES ('\xC00LC*DE'); -- invalid hexadecimal digit: "*"
+INSERT INTO BYTEA_TABLE VALUES ('\xbad f00d'); -- invalid hexadecimal digit: " "
+
+-- long hex strings to test SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || repeat('baadf00d', 8))::bytea;
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || '   ' || repeat('baad f00d', 8))::bytea; -- hex string with whitespaces
+
+-- errors checking for SIMD implementation
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'badf00d' || repeat('DEADC0DE', 4))::bytea; -- odd number of hex digits
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'baadfood'|| repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "o"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 4) || 'C00LC0FFEE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "L"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'C00LC*DE' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: "*"
+INSERT INTO BYTEA_TABLE SELECT ('\x' || repeat('DEADC0DE', 8) || 'bad f00d' || repeat('DEADC0DE', 4))::bytea; -- invalid hexadecimal digit: " "
+
+SELECT encode(data, 'hex') FROM BYTEA_TABLE;
+
+DROP TABLE BYTEA_TABLE;
-- 
2.34.1

bytea_read_hex_encode_sve.svgimage/svg+xml; name=bytea_read_hex_encode_sve.svgDownload

bytea_read_hex_encode_v17.svgimage/svg+xml; name=bytea_read_hex_encode_v17.svgDownload

bytea_read_hex_encode_v18.svgimage/svg+xml; name=bytea_read_hex_encode_v18.svgDownload

bytea_write_hex_decode_sve.svgimage/svg+xml; name=bytea_write_hex_decode_sve.svgDownload

bytea_write_hex_decode_v18.svgimage/svg+xml; name=bytea_write_hex_decode_v18.svgDownload

#32

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#31)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Thu, Sep 04, 2025 at 02:55:50PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

I see that there was some discussion about a Neon implementation upthread,
but I'm not sure we concluded anything. For popcount, we first added a
Neon version before adding the SVE version, which required more complicated
configure/runtime checks. Presumably Neon is available on more hardware
than SVE, so that could be a good place to start here, too.

We have added the Neon versions of hex encode/decode.

Thanks. I noticed that this stuff is simple enough that we can use
port/simd.h (with a few added functions). This is especially nice because
it takes care of x86, too. The performance gains look similar to what you
reported for v6:

arm
buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 13 | 6 | 54
64 | 34 | 9 | 74
256 | 93 | 25 | 73
1024 | 281 | 78 | 72
4096 | 1086 | 227 | 79
16384 | 4382 | 927 | 79
65536 | 17455 | 3608 | 79

x86
buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 10 | 7 | 30
64 | 29 | 9 | 69
256 | 81 | 21 | 74
1024 | 286 | 66 | 77
4096 | 1106 | 253 | 77
16384 | 4383 | 980 | 78
65536 | 17491 | 3886 | 78

I've only modified hex_encode() for now, but I'm optimistic that we can do
something similar for hex_decode().

--
nathan

Attachments:

v7-0001-Optimize-hex_encode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From f2b4f8cf844dead4658469257b771d3394a46ed0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 10 Sep 2025 21:37:20 -0500
Subject: [PATCH v7 1/1] Optimize hex_encode() using SIMD.

---
 src/backend/utils/adt/encode.c |  56 +++++++++++++++-
 src/include/port/simd.h        | 118 +++++++++++++++++++++++++++++++++
 2 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d1..0372d0e787a 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,57 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.  It might be
+	 * possible to squeeze out a little more gain by manually unrolling the
+	 * loop, but for now we don't bother.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector32_shift_right_nibble((Vector32) hi);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..f1d5353d2b3 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -70,6 +70,7 @@ static inline void vector32_load(Vector32 *v, const uint32 *s);
 static inline Vector8 vector8_broadcast(const uint8 c);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_broadcast(const uint32 c);
+static inline void vector8_store(uint8 *s, Vector8 v);
 #endif
 
 /* element-wise comparisons to a scalar */
@@ -86,6 +87,8 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_and(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_add(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 #endif
 
@@ -99,6 +102,14 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_gt(const Vector8 v1, const Vector8 v2);
+#endif
+
+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
 #endif
 
 /*
@@ -128,6 +139,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -358,6 +384,36 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
  * vectors using saturation (i.e., if the operation would yield a value less
@@ -404,6 +460,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined (USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +492,49 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_right_nibble(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return vshrq_n_u32(v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.5 (Apple Git-154)

#33

Chiranmoy.Bhattacharya@fujitsu.com

4 months ago

In reply to: Nathan Bossart (#32)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Thanks. I noticed that this stuff is simple enough that we can use
port/simd.h (with a few added functions). This is especially nice because
it takes care of x86, too. The performance gains look similar to what you
reported for v6:

This looks good, much cleaner.
One possible improvement would be to use a vectorized table lookup instead of compare and add. I compared v6 and v7 Neon versions, and v6 is always faster.
I’m not sure if SSE2 has a table lookup similar to Neon.

arm - m7g.4xlarge
buf | v6-Neon| v7-Neon| % diff
-------+--------+--------+--------
64 | 6.16 | 8.57 | 28.07
128 | 11.37 | 15.77 | 27.87
256 | 18.54 | 30.28 | 38.77
512 | 33.98 | 62.15 | 45.33
1024 | 64.46 | 117.55 | 45.16
2048 | 124.28 | 254.86 | 51.24
4096 | 243.47 | 509.23 | 52.19
8192 | 487.34 | 953.81 | 48.91

-----
Chiranmoy

#34

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#33)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Thu, Sep 11, 2025 at 10:43:56AM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

One possible improvement would be to use a vectorized table lookup
instead of compare and add. I compared v6 and v7 Neon versions, and v6 is
always faster. I’m not sure if SSE2 has a table lookup similar to Neon.

I'm not finding a simple way to do that kind of table lookup in SSE2. Part
of the reason v6 performs better is because you've unrolled the loop to
process 2 vector's worth of input data in each iteration. This trades
performance with smaller inputs for gains with larger ones. But even if I
do something similar for v7, v6 still wins most of the time.

My current philosophy with this stuff is to favor simplicity,
maintainability, portability, etc. over extracting the absolute maximum
amount of performance gain, so I think we should proceed with the simd.h
approach. But I'm curious how others feel about this.

v8 is an attempt to fix the casting error on MSVC.

--
nathan

Attachments:

v8-0001-Optimize-hex_encode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From 746dc9e3d2673ce53ae0ddc46120f13c667a2817 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 10 Sep 2025 21:37:20 -0500
Subject: [PATCH v8 1/1] Optimize hex_encode() using SIMD.

---
 src/backend/utils/adt/encode.c |  56 +++++++++++++++-
 src/include/port/simd.h        | 118 +++++++++++++++++++++++++++++++++
 2 files changed, 172 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d1..62a37e961d4 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,57 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.  It might be
+	 * possible to squeeze out a little more gain by manually unrolling the
+	 * loop, but for now we don't bother.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector32_shift_right_nibble(hi);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..f1d5353d2b3 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -70,6 +70,7 @@ static inline void vector32_load(Vector32 *v, const uint32 *s);
 static inline Vector8 vector8_broadcast(const uint8 c);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_broadcast(const uint32 c);
+static inline void vector8_store(uint8 *s, Vector8 v);
 #endif
 
 /* element-wise comparisons to a scalar */
@@ -86,6 +87,8 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_and(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_add(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 #endif
 
@@ -99,6 +102,14 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_gt(const Vector8 v1, const Vector8 v2);
+#endif
+
+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
 #endif
 
 /*
@@ -128,6 +139,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -358,6 +384,36 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
  * vectors using saturation (i.e., if the operation would yield a value less
@@ -404,6 +460,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined (USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +492,49 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_right_nibble(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return vshrq_n_u32(v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.5 (Apple Git-154)

#35

Tom Lane

tgl@sss.pgh.pa.us

4 months ago

In reply to: Nathan Bossart (#34)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

Nathan Bossart <nathandbossart@gmail.com> writes:

My current philosophy with this stuff is to favor simplicity,
maintainability, portability, etc. over extracting the absolute maximum
amount of performance gain, so I think we should proceed with the simd.h
approach. But I'm curious how others feel about this.

+1. The maintainability aspect is critical over the long run.
Also, there's a very real danger of optimizing for the specific
hardware and test case you are working with, leading to actually
worse performance with future hardware.

regards, tom lane

#36

Chiranmoy.Bhattacharya@fujitsu.com

4 months ago

In reply to: Tom Lane (#35)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

My current philosophy with this stuff is to favor simplicity,
maintainability, portability, etc. over extracting the absolute maximum
amount of performance gain, so I think we should proceed with the simd.h
approach. But I'm curious how others feel about this.

+1. The maintainability aspect is critical over the long run.
Also, there's a very real danger of optimizing for the specific
hardware and test case you are working with, leading to actually
worse performance with future hardware.

Using simd.h does make it easier to maintain.
Is there a plan to upgrade simd.h to use SSE4 or SSSE3 in the future?
Since SSE2 is much older, it lacks some of the more specialized intrinsics.
For example, vectorized table lookup can be implemented via [0]https://www.felixcloutier.com/x86/pshufb, and
it’s available in SSSE3 and later x86 instruction sets.

[0]: https://www.felixcloutier.com/x86/pshufb

-----
Chiranmoy

#37

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Chiranmoy.Bhattacharya@fujitsu.com (#36)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Sep 12, 2025 at 06:49:01PM +0000, Chiranmoy.Bhattacharya@fujitsu.com wrote:

Using simd.h does make it easier to maintain. Is there a plan to upgrade
simd.h to use SSE4 or SSSE3 in the future? Since SSE2 is much older, it
lacks some of the more specialized intrinsics. For example, vectorized
table lookup can be implemented via [0], and it’s available in SSSE3 and
later x86 instruction sets.

There have been a couple of discussions about the possibility of requiring
x86-64-v2 for Postgres, but I'm not aware of any serious efforts in that
area.

I've attached a new version of the patch with a simd.h version of
hex_decode(). Here are the numbers:

arm
buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 22 | 23 | -5
64 | 61 | 23 | 62
256 | 158 | 47 | 70
1024 | 542 | 122 | 77
4096 | 2103 | 429 | 80
16384 | 8548 | 1673 | 80
65536 | 34663 | 6738 | 81

x86
buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 13 | 14 | -8
64 | 42 | 15 | 64
256 | 126 | 42 | 67
1024 | 461 | 149 | 68
4096 | 1802 | 576 | 68
16384 | 7166 | 2280 | 68
65536 | 28625 | 9108 | 68

A couple of notes:

* For hex_decode(), we just give up on the SIMD path and fall back on the
scalar path as soon as we see anything outside [0-9A-Za-z]. I suspect
this might introduce a regression for inputs of ~32 to ~64 bytes that
include whitespace (which must be skipped) or invalid characters, but I
don't whether those inputs are common or whether we care.

* The code makes some assumptions about endianness that might not be true
everywhere, but I've yet to dig into this further.

--
nathan

Attachments:

v9-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From 155535f92df1d9cdf97739465b26ba970ed97063 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 12 Sep 2025 15:51:55 -0500
Subject: [PATCH v9 1/1] Optimize hex_encode() and hex_decode() using SIMD.

---
 src/backend/utils/adt/encode.c | 142 ++++++++++++++++++++++-
 src/include/port/simd.h        | 201 +++++++++++++++++++++++++++++++++
 2 files changed, 339 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 4ccaed815d1..13883c27680 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,57 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.  It might be
+	 * possible to squeeze out a little more gain by manually unrolling the
+	 * loop, but for now we don't bother.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector32_shift_right_nibble(hi);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
@@ -213,8 +265,8 @@ hex_decode(const char *src, size_t len, char *dst)
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
-uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+static inline uint64
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +306,88 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+/*
+ * This helper converts each byte to its binary-equivalent nibble by
+ * subtraction.  Along the way, it verifies the input is within the expected
+ * range of ASCII values and returns false if not.  Finally, it combines the
+ * generated nibbles to form the return bytes, which will be separated by zero
+ * bytes in the destination vector.  If everything goes as planned, returns
+ * true.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
+{
+	Vector8		msk;
+	Vector8		sub;
+
+	msk = vector8_lt(src, vector8_broadcast('0'));
+	if (unlikely(vector8_is_highbit_set(msk)))
+		return false;
+
+	msk = vector8_lt(src, vector8_broadcast('A'));
+	sub = vector8_and(msk, vector8_broadcast('0'));
+	*dst = vector8_ssub(src, sub);
+
+	msk = vector8_and(*dst, msk);
+	msk = vector8_gt(msk, vector8_broadcast(0x9));
+	if (unlikely(vector8_is_highbit_set(msk)))
+		return false;
+
+	msk = vector8_gt(src, vector8_broadcast('a' - 1));
+	sub = vector8_and(msk, vector8_broadcast('a' - 10));
+	msk = vector8_xor(msk, vector8_gt(sub, vector8_broadcast('A' - 1)));
+	msk = vector8_and(msk, vector8_broadcast('A' - 10));
+	sub = vector8_or(sub, msk);
+	*dst = vector8_ssub(*dst, sub);
+
+	msk = vector8_gt(*dst, vector8_broadcast(0xf));
+	if (unlikely(vector8_is_highbit_set(msk)))
+		return false;
+
+	msk = vector8_and(*dst, vector32_broadcast(0xff00ff00));
+	msk = vector32_shift_right_byte(msk);
+	*dst = vector8_and(*dst, vector32_broadcast(0x00ff00ff));
+	*dst = vector32_shift_left_nibble(*dst);
+	*dst = vector8_or(*dst, msk);
+	return true;
+}
+#endif							/* ! USE_NO_SIMD */
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_NO_SIMD
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) * 2 - 1);
+	uint64		i;
+
+	/*
+	 * We must process 2 vectors at a time since the output will be half the
+	 * length of the input.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8) * 2)
+	{
+		Vector8		srcv;
+		Vector8		dstv1;
+		Vector8		dstv2;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+		if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+			break;
+
+		vector8_load(&srcv, (const uint8 *) &src[i + sizeof(Vector8)]);
+		if (unlikely(!hex_decode_simd_helper(srcv, &dstv2)))
+			break;
+
+		vector8_store((uint8 *) &dst[i / 2], vector8_pack_16(dstv1, dstv2));
+	}
+
+	return i / 2 + hex_decode_safe_scalar(src + i, len - i, dst + i / 2, escontext);
+#endif
+}
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..0a9805e8ef1 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -70,6 +70,7 @@ static inline void vector32_load(Vector32 *v, const uint32 *s);
 static inline Vector8 vector8_broadcast(const uint8 c);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_broadcast(const uint32 c);
+static inline void vector8_store(uint8 *s, Vector8 v);
 #endif
 
 /* element-wise comparisons to a scalar */
@@ -86,6 +87,9 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_xor(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_and(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_add(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 #endif
 
@@ -99,6 +103,18 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_lt(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_gt(const Vector8 v1, const Vector8 v2);
+#endif
+
+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_byte(const Vector32 v1);
 #endif
 
 /*
@@ -128,6 +144,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -358,6 +389,51 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise XOR of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_xor(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_xor_si128(v1, v2);
+#elif defined(USE_NEON)
+	return veorq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
  * vectors using saturation (i.e., if the operation would yield a value less
@@ -404,6 +480,39 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is less than the
+ * corresponding lane of v2.  NB: The comparison treats the elements as signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_lt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmplt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcltq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +528,96 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_pack_16(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_packus_epi16(v1, v2);
+#elif defined(USE_NEON)
+	return vuzp1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift left of each element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_left_nibble(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_slli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return vshlq_n_u32(v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_right_nibble(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return vshrq_n_u32(v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each element in the vector by 1 byte.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_right_byte(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 8);
+#elif defined(USE_NEON)
+	return vshrq_n_u32(v1, 8);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.5 (Apple Git-154)

#38

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Nathan Bossart (#37)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Sep 12, 2025 at 04:30:21PM -0500, Nathan Bossart wrote:

I've attached a new version of the patch with a simd.h version of
hex_decode(). Here are the numbers:

I was able to improve the hex_decode() implementation a bit.

arm
buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 11 | 11 | 0
64 | 38 | 7 | 82
256 | 133 | 18 | 86
1024 | 513 | 67 | 87
4096 | 2037 | 271 | 87
16384 | 8326 | 1103 | 87
65536 | 34550 | 4475 | 87

x86
buf | HEAD | patch | % diff
-------+-------+-------+--------
16 | 8 | 9 | -13
64 | 38 | 7 | 82
256 | 121 | 24 | 80
1024 | 457 | 91 | 80
4096 | 1797 | 356 | 80
16384 | 7161 | 1411 | 80
65536 | 28620 | 5632 | 80

--
nathan

Attachments:

v10-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From ba097b763eb80ab8c9c78503fcb5be5575342ff8 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 12 Sep 2025 15:51:55 -0500
Subject: [PATCH v10 1/1] Optimize hex_encode() and hex_decode() using SIMD.

---
 src/backend/utils/adt/encode.c | 131 ++++++++++++++++++-
 src/include/port/simd.h        | 227 ++++++++++++++++++++++++++++++++-
 2 files changed, 348 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 9a9c7e8da99..028e0ca6887 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,57 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.  It might be
+	 * possible to squeeze out a little more gain by manually unrolling the
+	 * loop, but for now we don't bother.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector32_shift_right_nibble(hi);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
@@ -213,8 +265,8 @@ hex_decode(const char *src, size_t len, char *dst)
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
-uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+static inline uint64
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +306,77 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+/*
+ * This helper converts each byte to its binary-equivalent nibble by
+ * subtraction and combines them to form the return bytes (separated by zero
+ * bytes).  Returns false if any input bytes are outside the expected ranges of
+ * ASCII values.  Otherwise, returns true.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
+{
+	Vector8		sub;
+	Vector8		msk;
+
+	msk = vector8_gt(vector8_broadcast('9' + 1), src);
+	sub = vector8_and(msk, vector8_broadcast('0'));
+
+	msk = vector8_gt(src, vector8_broadcast('A' - 1));
+	msk = vector8_and(msk, vector8_broadcast('A' - 10));
+	sub = vector8_add(sub, msk);
+
+	msk = vector8_gt(src, vector8_broadcast('a' - 1));
+	msk = vector8_and(msk, vector8_broadcast('a' - 'A'));
+	sub = vector8_add(sub, msk);
+
+	*dst = vector8_sssub(src, sub);
+	if (unlikely(vector8_has_ge(*dst, 0x10)))
+		return false;
+
+	msk = vector8_and(*dst, vector32_broadcast(0xff00ff00));
+	msk = vector32_shift_right_byte(msk);
+	*dst = vector8_and(*dst, vector32_broadcast(0x00ff00ff));
+	*dst = vector32_shift_left_nibble(*dst);
+	*dst = vector8_or(*dst, msk);
+	return true;
+}
+#endif							/* ! USE_NO_SIMD */
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_NO_SIMD
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) * 2 - 1);
+	uint64		i;
+
+	/*
+	 * We must process 2 vectors at a time since the output will be half the
+	 * length of the input.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8) * 2)
+	{
+		Vector8		srcv;
+		Vector8		dstv1;
+		Vector8		dstv2;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+		if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+			break;
+
+		vector8_load(&srcv, (const uint8 *) &src[i + sizeof(Vector8)]);
+		if (unlikely(!hex_decode_simd_helper(srcv, &dstv2)))
+			break;
+
+		vector8_store((uint8 *) &dst[i / 2], vector8_pack_16(dstv1, dstv2));
+	}
+
+	return i / 2 + hex_decode_safe_scalar(src + i, len - i, dst + i / 2, escontext);
+#endif
+}
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..8059e37acc7 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -70,12 +70,16 @@ static inline void vector32_load(Vector32 *v, const uint32 *s);
 static inline Vector8 vector8_broadcast(const uint8 c);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_broadcast(const uint32 c);
+static inline void vector8_store(uint8 *s, Vector8 v);
 #endif
 
 /* element-wise comparisons to a scalar */
 static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
+#ifndef USE_NO_SIMD
+static inline bool vector8_has_ge(const Vector8 v, const uint8 c);
+#endif
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
@@ -86,7 +90,10 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
-static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_and(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_add(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_sssub(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_ussub(const Vector8 v1, const Vector8 v2);
 #endif
 
 /*
@@ -99,6 +106,17 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
 static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
+static inline Vector8 vector8_gt(const Vector8 v1, const Vector8 v2);
+#endif
+
+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_byte(const Vector32 v1);
 #endif
 
 /*
@@ -128,6 +146,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -257,13 +290,32 @@ vector8_has_le(const Vector8 v, const uint8 c)
 	 * NUL bytes.  This approach is a workaround for the lack of unsigned
 	 * comparison instructions on some architectures.
 	 */
-	result = vector8_has_zero(vector8_ssub(v, vector8_broadcast(c)));
+	result = vector8_has_zero(vector8_ussub(v, vector8_broadcast(c)));
 #endif
 
 	Assert(assert_result == result);
 	return result;
 }
 
+/*
+ * Returns true if any elements in the vector are greater than or equal to the
+ * given scalar.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+	Vector8		umax = _mm_max_epu8(v, vector8_broadcast(c));
+	Vector8		cmpe = _mm_cmpeq_epi8(umax, v);
+
+	return vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+	return vmaxvq_u8(v) >= c;
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if the high bit of any element is set
  */
@@ -358,15 +410,65 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
- * vectors using saturation (i.e., if the operation would yield a value less
- * than zero, zero is returned instead).  For more information on saturation
- * arithmetic, see https://en.wikipedia.org/wiki/Saturation_arithmetic
+ * vectors using signed saturation (i.e., if the operation would yield a value
+ * less than -128, -128 is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
  */
 #ifndef USE_NO_SIMD
 static inline Vector8
-vector8_ssub(const Vector8 v1, const Vector8 v2)
+vector8_sssub(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_subs_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vqsubq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of subtracting the respective elements of the input
+ * vectors using unsigned saturation (i.e., if the operation would yield a
+ * value less than zero, zero is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_ussub(const Vector8 v1, const Vector8 v2)
 {
 #ifdef USE_SSE2
 	return _mm_subs_epu8(v1, v2);
@@ -404,6 +506,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +538,100 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_pack_16(const Vector8 v1, const Vector8 v2)
+{
+	Vector32	mask PG_USED_FOR_ASSERTS_ONLY = vector32_broadcast(0xff00ff00);
+
+	Assert(!vector8_has_ge(vector8_and(v1, mask), 1));
+	Assert(!vector8_has_ge(vector8_and(v2, mask), 1));
+#ifdef USE_SSE2
+	return _mm_packus_epi16(v1, v2);
+#elif defined(USE_NEON)
+	return vuzp1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift left of each element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_left_nibble(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_slli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return vshlq_n_u32(v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_right_nibble(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return vshrq_n_u32(v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each element in the vector by 1 byte.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector32
+vector32_shift_right_byte(const Vector32 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 8);
+#elif defined(USE_NEON)
+	return vshrq_n_u32(v1, 8);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.5 (Apple Git-154)

#39

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: Nathan Bossart (#38)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote:

I was able to improve the hex_decode() implementation a bit.

I took a closer look at how hex_decode() performs with smaller inputs.
There are some small regressions, so I tried fixing them by adding the
following to the beginning of the function:

if (likely(tail_idx == 0))
return hex_decode_safe_scalar(src, len, dst, escontext);

This helped a little, but it mostly just slowed things down for larger
inputs on AArch64:

arm
buf | HEAD | patch | fix
-------+-------+-------+-------
2 | 4 | 6 | 4
4 | 6 | 7 | 7
8 | 8 | 8 | 8
16 | 11 | 12 | 11
32 | 18 | 5 | 6
64 | 38 | 7 | 8
256 | 134 | 18 | 24
1024 | 514 | 67 | 100
4096 | 2072 | 280 | 389
16384 | 8409 | 1126 | 1537
65536 | 34704 | 4498 | 6128

x86
buf | HEAD | patch | fix
-------+-------+-------+-------
2 | 2 | 2 | 2
4 | 3 | 3 | 3
8 | 4 | 4 | 4
16 | 8 | 9 | 8
32 | 23 | 5 | 5
64 | 37 | 7 | 7
256 | 122 | 24 | 24
1024 | 457 | 91 | 92
4096 | 1798 | 357 | 358
16384 | 7161 | 1411 | 1416
65536 | 28621 | 5630 | 5653

I didn't do this test for hex_encode(), but I'd expect it to follow a
similar pattern. I'm tempted to suggest that these regressions are within
tolerable levels and to forge on with v10. In any case, IMHO this patch is
approaching committable quality, so I'd be grateful for any feedback.

--
nathan

#40

John Naylor

johncnaylorls@gmail.com

4 months ago

In reply to: Nathan Bossart (#39)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Sep 24, 2025 at 2:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Sep 22, 2025 at 03:05:44PM -0500, Nathan Bossart wrote:

I was able to improve the hex_decode() implementation a bit.

I took a closer look at how hex_decode() performs with smaller inputs.
There are some small regressions, so I tried fixing them by adding the
following to the beginning of the function:

if (likely(tail_idx == 0))
return hex_decode_safe_scalar(src, len, dst, escontext);

This helped a little, but it mostly just slowed things down for larger
inputs on AArch64:

I didn't do this test for hex_encode(), but I'd expect it to follow a
similar pattern. I'm tempted to suggest that these regressions are within
tolerable levels and to forge on with v10.

My first thought is, I'd hazard a guess that short byteas are much
less common than short strings.

My second thought is, the decode case is not that critical. From the
end-to-end tests above, the speed of the decode case had a relatively
small global effect compared to the encode case (Perhaps because reads
are cheaper than writes).

+ if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+ break;

But if you really want to do something here, sprinkling "(un)likely"'s
here seems like solving the wrong problem (even if they make any
difference), since the early return is optimizing for exceptional
conditions. In other places (cf. the UTF8 string verifier), we
accumulate errors, and only if we have them at the end do we restart
from the beginning with the slow error-checking path that can show the
user the offending input.

In any case, IMHO this patch is
approaching committable quality, so I'd be grateful for any feedback.

+vector8_sssub(const Vector8 v1, const Vector8 v2)

It's hard to parse "sss", so maybe we can borrow an Intel-ism and use
"iss" for the signed case?

+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_byte(const Vector32 v1);

Do we need declarations for these? I recall that the existing
declarations are there for functions that are also used internally.

The nibble/byte things are rather specific. Wouldn't it be more
logical to expose the already-generic shift operations and let the
caller say by how much? Or does the compiler refuse because the
intrinsic doesn't get an immediate value? Some are like that, but I'm
not sure about these. If so, that's annoying and I wonder if there's a
workaround.

+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+ Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
+ Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
+
+ return vector8_is_highbit_set(cmpe);

We take pains to avoid signed comparison on unsigned input for the
"le" case, and I don't see why it's okay here.

Do the regression tests have long enough cases that test exceptional
paths, like invalid bytes and embedded whitespace? If not, we need
some.

--
John Naylor
Amazon Web Services

#41

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: John Naylor (#40)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:

+ if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+ break;
But if you really want to do something here, sprinkling "(un)likely"'s
here seems like solving the wrong problem (even if they make any
difference), since the early return is optimizing for exceptional
conditions. In other places (cf. the UTF8 string verifier), we
accumulate errors, and only if we have them at the end do we restart
from the beginning with the slow error-checking path that can show the
user the offending input.

I switched to an accumulator approach in v11.

+vector8_sssub(const Vector8 v1, const Vector8 v2)

It's hard to parse "sss", so maybe we can borrow an Intel-ism and use
"iss" for the signed case?

Done.

+/* vector manipulation */
+#ifndef USE_NO_SIMD
+static inline Vector8 vector8_interleave_low(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_interleave_high(const Vector8 v1, const
Vector8 v2);
+static inline Vector8 vector8_pack_16(const Vector8 v1, const Vector8 v2);
+static inline Vector32 vector32_shift_left_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_nibble(const Vector32 v1);
+static inline Vector32 vector32_shift_right_byte(const Vector32 v1);

Do we need declarations for these? I recall that the existing
declarations are there for functions that are also used internally.

Removed.

The nibble/byte things are rather specific. Wouldn't it be more
logical to expose the already-generic shift operations and let the
caller say by how much? Or does the compiler refuse because the
intrinsic doesn't get an immediate value? Some are like that, but I'm
not sure about these. If so, that's annoying and I wonder if there's a
workaround.

Yeah, the compiler refuses unless the value is an integer literal. I
thought of using a switch statement to cover all the values used in-tree,
but I didn't like that, either.

+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+ Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
+ Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
+
+ return vector8_is_highbit_set(cmpe);
We take pains to avoid signed comparison on unsigned input for the
"le" case, and I don't see why it's okay here.

_mm_max_epu8() does unsigned comparisons, I think...

Do the regression tests have long enough cases that test exceptional
paths, like invalid bytes and embedded whitespace? If not, we need
some.

Added.

I've also fixed builds on gcc/arm64, as discussed elsewhere [0]/messages/by-id/aNQtN89VW8z-yo3B@nathan. Here are
the current numbers on my laptop:

arm
buf | HEAD | patch | % diff
-------+-------+-------+--------
2 | 4 | 4 | 0
4 | 6 | 6 | 0
8 | 8 | 8 | 0
16 | 11 | 12 | -9
32 | 18 | 5 | 72
64 | 38 | 6 | 84
256 | 134 | 17 | 87
1024 | 513 | 63 | 88
4096 | 2081 | 262 | 87
16384 | 8524 | 1058 | 88
65536 | 34731 | 4224 | 88

[0]: /messages/by-id/aNQtN89VW8z-yo3B@nathan

--
nathan

Attachments:

v11-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From b914ad69199bfbb4af95b97f83568401bc42f319 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 12 Sep 2025 15:51:55 -0500
Subject: [PATCH v11 1/1] Optimize hex_encode() and hex_decode() using SIMD.

---
 src/backend/utils/adt/encode.c        | 137 +++++++++++++++-
 src/include/port/simd.h               | 223 +++++++++++++++++++++++++-
 src/test/regress/expected/strings.out |  58 +++++++
 src/test/regress/sql/strings.sql      |  16 ++
 4 files changed, 424 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 9a9c7e8da99..7ba92c2c481 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,57 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.  It might be
+	 * possible to squeeze out a little more gain by manually unrolling the
+	 * loop, but for now we don't bother.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector8_shift_right_nibble(hi);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
@@ -213,8 +265,8 @@ hex_decode(const char *src, size_t len, char *dst)
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
-uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+static inline uint64
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +306,83 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+/*
+ * This helper converts each byte to its binary-equivalent nibble by
+ * subtraction and combines them to form the return bytes (separated by zero
+ * bytes).  Returns false if any input bytes are outside the expected ranges of
+ * ASCII values.  Otherwise, returns true.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
+{
+	Vector8		sub;
+	Vector8		msk;
+	bool		ret;
+
+	msk = vector8_gt(vector8_broadcast('9' + 1), src);
+	sub = vector8_and(msk, vector8_broadcast('0'));
+
+	msk = vector8_gt(src, vector8_broadcast('A' - 1));
+	msk = vector8_and(msk, vector8_broadcast('A' - 10));
+	sub = vector8_add(sub, msk);
+
+	msk = vector8_gt(src, vector8_broadcast('a' - 1));
+	msk = vector8_and(msk, vector8_broadcast('a' - 'A'));
+	sub = vector8_add(sub, msk);
+
+	*dst = vector8_issub(src, sub);
+	ret = !vector8_has_ge(*dst, 0x10);
+
+	msk = vector8_and(*dst, vector8_broadcast_u32(0xff00ff00));
+	msk = vector8_shift_right_byte(msk);
+	*dst = vector8_and(*dst, vector8_broadcast_u32(0x00ff00ff));
+	*dst = vector8_shift_left_nibble(*dst);
+	*dst = vector8_or(*dst, msk);
+	return ret;
+}
+#endif							/* ! USE_NO_SIMD */
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_NO_SIMD
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) * 2 - 1);
+	uint64		i;
+	bool		success = true;
+
+	/*
+	 * We must process 2 vectors at a time since the output will be half the
+	 * length of the input.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8) * 2)
+	{
+		Vector8		srcv;
+		Vector8		dstv1;
+		Vector8		dstv2;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+		success &= hex_decode_simd_helper(srcv, &dstv1);
+
+		vector8_load(&srcv, (const uint8 *) &src[i + sizeof(Vector8)]);
+		success &= hex_decode_simd_helper(srcv, &dstv2);
+
+		vector8_store((uint8 *) &dst[i / 2], vector8_pack_16(dstv1, dstv2));
+	}
+
+	/*
+	 * If something didn't look right in the vector path, try again in the
+	 * scalar path so that we can handle it correctly.
+	 */
+	if (unlikely(!success))
+		i = 0;
+
+	return i / 2 + hex_decode_safe_scalar(src + i, len - i, dst + i / 2, escontext);
+#endif
+}
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..531d8b8b6d1 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -86,7 +86,7 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
-static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_ussub(const Vector8 v1, const Vector8 v2);
 #endif
 
 /*
@@ -128,6 +128,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -155,6 +170,24 @@ vector32_broadcast(const uint32 c)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+	return vector32_broadcast(c);
+#elif defined(USE_NEON)
+	return (Vector8) vector32_broadcast(c);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if any elements in the vector are equal to the given scalar.
  */
@@ -257,13 +290,32 @@ vector8_has_le(const Vector8 v, const uint8 c)
 	 * NUL bytes.  This approach is a workaround for the lack of unsigned
 	 * comparison instructions on some architectures.
 	 */
-	result = vector8_has_zero(vector8_ssub(v, vector8_broadcast(c)));
+	result = vector8_has_zero(vector8_ussub(v, vector8_broadcast(c)));
 #endif
 
 	Assert(assert_result == result);
 	return result;
 }
 
+/*
+ * Returns true if any elements in the vector are greater than or equal to the
+ * given scalar.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+	Vector8		umax = _mm_max_epu8(v, vector8_broadcast(c));
+	Vector8		cmpe = _mm_cmpeq_epi8(umax, v);
+
+	return vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+	return vmaxvq_u8(v) >= c;
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if the high bit of any element is set
  */
@@ -358,15 +410,65 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of subtracting the respective elements of the input
+ * vectors using signed saturation (i.e., if the operation would yield a value
+ * less than -128, -128 is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_issub(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_subs_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return (Vector8) vqsubq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
- * vectors using saturation (i.e., if the operation would yield a value less
- * than zero, zero is returned instead).  For more information on saturation
- * arithmetic, see https://en.wikipedia.org/wiki/Saturation_arithmetic
+ * vectors using unsigned saturation (i.e., if the operation would yield a
+ * value less than zero, zero is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
  */
 #ifndef USE_NO_SIMD
 static inline Vector8
-vector8_ssub(const Vector8 v1, const Vector8 v2)
+vector8_ussub(const Vector8 v1, const Vector8 v2)
 {
 #ifdef USE_SSE2
 	return _mm_subs_epu8(v1, v2);
@@ -404,6 +506,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +538,96 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_pack_16(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_packus_epi16(v1, v2);
+#elif defined(USE_NEON)
+	return vuzp1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift left of each 32-bit element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_left_nibble(const Vector8 v1)
+{
+#ifdef USE_SSE2
+	return _mm_slli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return (Vector8) vshlq_n_u32((Vector32) v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each 32-bit element in the vector by 4 bits.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_right_nibble(const Vector8 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 4);
+#elif defined(USE_NEON)
+	return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each 32-bit element in the vector by 1 byte.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_right_byte(const Vector8 v1)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, 8);
+#elif defined(USE_NEON)
+	return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 691e475bce3..3e351cf1cd0 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -260,6 +260,64 @@ SELECT reverse('\xabcd'::bytea);
  \xcdab
 (1 row)
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+ bytea 
+-------
+ \x
+(1 row)
+
+SELECT ('\x' || repeat('!', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "!"
+SELECT ('\x' || repeat('/', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "/"
+SELECT ('\x' || repeat('0', 32))::bytea;
+               bytea                
+------------------------------------
+ \x00000000000000000000000000000000
+(1 row)
+
+SELECT ('\x' || repeat('9', 32))::bytea;
+               bytea                
+------------------------------------
+ \x99999999999999999999999999999999
+(1 row)
+
+SELECT ('\x' || repeat(':', 33))::bytea;
+ERROR:  invalid hexadecimal digit: ":"
+SELECT ('\x' || repeat('@', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "@"
+SELECT ('\x' || repeat('A', 32))::bytea;
+               bytea                
+------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('F', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('G', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "G"
+SELECT ('\x' || repeat('`', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "`"
+SELECT ('\x' || repeat('a', 32))::bytea;
+               bytea                
+------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('f', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('g', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "g"
+SELECT ('\x' || repeat('~', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "~"
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
       bytea       
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index c05f3413699..35910369b6f 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -82,6 +82,22 @@ SELECT reverse(''::bytea);
 SELECT reverse('\xaa'::bytea);
 SELECT reverse('\xabcd'::bytea);
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+SELECT ('\x' || repeat('!', 33))::bytea;
+SELECT ('\x' || repeat('/', 33))::bytea;
+SELECT ('\x' || repeat('0', 32))::bytea;
+SELECT ('\x' || repeat('9', 32))::bytea;
+SELECT ('\x' || repeat(':', 33))::bytea;
+SELECT ('\x' || repeat('@', 33))::bytea;
+SELECT ('\x' || repeat('A', 32))::bytea;
+SELECT ('\x' || repeat('F', 32))::bytea;
+SELECT ('\x' || repeat('G', 33))::bytea;
+SELECT ('\x' || repeat('`', 33))::bytea;
+SELECT ('\x' || repeat('a', 32))::bytea;
+SELECT ('\x' || repeat('f', 32))::bytea;
+SELECT ('\x' || repeat('g', 33))::bytea;
+SELECT ('\x' || repeat('~', 33))::bytea;
+
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
 SELECT E'\\x De Ad Be Ef '::bytea;
-- 
2.39.5 (Apple Git-154)

#42

John Naylor

johncnaylorls@gmail.com

4 months ago

In reply to: Nathan Bossart (#41)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Thu, Sep 25, 2025 at 4:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Sep 24, 2025 at 10:59:38AM +0700, John Naylor wrote:
+ if (unlikely(!hex_decode_simd_helper(srcv, &dstv1)))
+ break;
But if you really want to do something here, sprinkling "(un)likely"'s
here seems like solving the wrong problem (even if they make any
difference), since the early return is optimizing for exceptional
conditions. In other places (cf. the UTF8 string verifier), we
accumulate errors, and only if we have them at the end do we restart
from the beginning with the slow error-checking path that can show the
user the offending input.
I switched to an accumulator approach in v11.

Looks good to me.

+ if (unlikely(!success))
+ i = 0;

This is after the main loop exits, and the cold path is literally one
instruction, so the motivation is not apparent to me.

The nibble/byte things are rather specific. Wouldn't it be more
logical to expose the already-generic shift operations and let the
caller say by how much? Or does the compiler refuse because the
intrinsic doesn't get an immediate value? Some are like that, but I'm
not sure about these. If so, that's annoying and I wonder if there's a
workaround.

Yeah, the compiler refuses unless the value is an integer literal. I
thought of using a switch statement to cover all the values used in-tree,
but I didn't like that, either.

Neither option is great, but I mildly lean towards keeping it internal
with "switch" or whatever: By putting the burden of specifying shift
amounts on separately named functions we run a risk of combinatorial
explosion in function names.

If you feel otherwise, I'd at least use actual numbers:
"shift_left_nibble" is an awkward way to say "shift left by 4 bits"
anyway, and also after "byte" and "nibble" there are not many good
English words to convey the operand amount. It's very possible that
needing other shift amounts will never come up, though.

+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+ Vector8 umax = _mm_max_epu8(v, vector8_broadcast(c));
+ Vector8 cmpe = _mm_cmpeq_epi8(umax, v);
+
+ return vector8_is_highbit_set(cmpe);
We take pains to avoid signed comparison on unsigned input for the
"le" case, and I don't see why it's okay here.
_mm_max_epu8() does unsigned comparisons, I think...

Ah, I confused myself about what the LE case was avoiding, namely
signed LE, not signed equality on something else.

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

Do the regression tests have long enough cases that test exceptional
paths, like invalid bytes and embedded whitespace? If not, we need
some.

Added.

Seems comprehensive enough at a glance.

Other comments:

+ * back together to form the final hex-encoded string.  It might be
+ * possible to squeeze out a little more gain by manually unrolling the
+ * loop, but for now we don't bother.

My position (and I think the community agrees) is that manual
unrolling is a rare desperation move that has to be justified, so we
don't need to mention its lack.

+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+ return vector32_broadcast(c);
+#elif defined(USE_NEON)
+ return (Vector8) vector32_broadcast(c);
+#endif
+}

I'm ambivalent about this: The use case doesn't seem well motivated,
since I don't know why we'd actually need to both broadcast arbitrary
integers and also view the result as bytes. Setting arbitrary bytes is
what we're really doing, and would be more likely be useful in the
future (attached, only tested on x86, and I think part of the
strangeness is the endianness you mentioned above). On the other hand,
the Arm workaround results in awful generated code compared to what
you have here. Since the "set" should be hoisted out of the outer
loop, and we already rely on this pattern for vector8_highbit_mask
anyway, it might be tolerable, and we can reduce the pain with bitwise
NOT.

+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */

v10 asserted this requirement -- that still seems like a good thing?

--
John Naylor
Amazon Web Services

Attachments:

vector8_set.patch.txttext/plain; charset=US-ASCII; name=vector8_set.patch.txtDownload

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 7ba92c2c481..eb932d4bd8a 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -317,6 +317,9 @@ static inline bool
 hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
 {
 	Vector8		sub;
+	// TODO: set one and use bitwise NOT for the other
+	Vector8		maskupper = vector8_set(0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0);
+	Vector8		masklower = vector8_set(0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff, 0, 0xff);
 	Vector8		msk;
 	bool		ret;
 
@@ -334,9 +337,9 @@ hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
 	*dst = vector8_issub(src, sub);
 	ret = !vector8_has_ge(*dst, 0x10);
 
-	msk = vector8_and(*dst, vector8_broadcast_u32(0xff00ff00));
+	msk = vector8_and(*dst, masklower);
 	msk = vector8_shift_right_byte(msk);
-	*dst = vector8_and(*dst, vector8_broadcast_u32(0x00ff00ff));
+	*dst = vector8_and(*dst, maskupper);
 	*dst = vector8_shift_left_nibble(*dst);
 	*dst = vector8_or(*dst, msk);
 	return ret;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 531d8b8b6d1..56e810ce081 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -101,6 +101,33 @@ static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
+/*
+ * Populate a vector element-wise with the arguments.
+ */
+#ifndef USE_NO_SIMD
+#if defined(USE_NEON)
+// from a patch by Thomas Munro
+static inline Vector8
+vector8_set(uint8 v0, uint8 v1, uint8 v2, uint8 v3,
+	 uint8 v4, uint8 v5, uint8 v6, uint8 v7,
+	 uint8 v8, uint8 v9, uint8 v10, uint8 v11,
+	 uint8 v12, uint8 v13, uint8 v14, uint8 v15)
+{
+	uint8 pg_attribute_aligned(16) values[16] = {
+		v0, v1, v2, v3,
+		v4, v5, v6, v7,
+		v8, v9, v10, v11,
+		v12, v13, v14, v15
+	};
+	return vld1q_u8(values);
+}
+#elif defined(USE_SSE2)
+#ifndef vector8_set
+#define vector8_set(...)		_mm_setr_epi8(__VA_ARGS__)
+#endif
+#endif
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Load a chunk of memory into the given vector.
  */
@@ -368,6 +395,7 @@ vector8_highbit_mask(const Vector8 v)
 	 * returns a uint64, making it inconvenient to combine mask values from
 	 * multiple vectors.
 	 */
+	// TODO: use vector8_set
 	static const uint8 mask[16] = {
 		1 << 0, 1 << 1, 1 << 2, 1 << 3,
 		1 << 4, 1 << 5, 1 << 6, 1 << 7,

#43

Nathan Bossart

nathandbossart@gmail.com

4 months ago

In reply to: John Naylor (#42)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:

+ if (unlikely(!success))
+ i = 0;

This is after the main loop exits, and the cold path is literally one
instruction, so the motivation is not apparent to me.

Removed. I was thinking about smaller inputs when I added this, but it
probably makes little difference.

Yeah, the compiler refuses unless the value is an integer literal. I
thought of using a switch statement to cover all the values used in-tree,
but I didn't like that, either.

Neither option is great, but I mildly lean towards keeping it internal
with "switch" or whatever: By putting the burden of specifying shift
amounts on separately named functions we run a risk of combinatorial
explosion in function names.

Done.

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

I think so. I doubt there's any performance advantage, but it could be
nice for code cleanup. (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.") I'll
do this in the related patch in the "couple of small patches for simd.h"
thread.

+ * back together to form the final hex-encoded string.  It might be
+ * possible to squeeze out a little more gain by manually unrolling the
+ * loop, but for now we don't bother.
My position (and I think the community agrees) is that manual
unrolling is a rare desperation move that has to be justified, so we
don't need to mention its lack.

Removed.

+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+ return vector32_broadcast(c);
+#elif defined(USE_NEON)
+ return (Vector8) vector32_broadcast(c);
+#endif
+}
I'm ambivalent about this: The use case doesn't seem well motivated,
since I don't know why we'd actually need to both broadcast arbitrary
integers and also view the result as bytes. Setting arbitrary bytes is
what we're really doing, and would be more likely be useful in the
future (attached, only tested on x86, and I think part of the
strangeness is the endianness you mentioned above). On the other hand,
the Arm workaround results in awful generated code compared to what
you have here. Since the "set" should be hoisted out of the outer
loop, and we already rely on this pattern for vector8_highbit_mask
anyway, it might be tolerable, and we can reduce the pain with bitwise
NOT.

I think I disagree on this one. We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size. I would expect this to be a relatively
common use-case. Furthermore, the "set" API is closely tethered to the
vector size, which is fine for SSE2/Neon but may not work down the road
(not to mention the USE_NO_SIMD path). Also, the bitwise NOT approach
won't work because we need to use 0x0f000f00 and 0x000f000f to avoid
angering the assertion in vector8_pack_16(), as mentioned below.

+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */

v10 asserted this requirement -- that still seems like a good thing?

I had removed that because I worried the accumulator approach would cause
it to fail (it does), but looking again, that's easy enough to work around.

--
nathan

Attachments:

v12-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From a8b7563265fa231c68d778cd0589f29d3695f81d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 12 Sep 2025 15:51:55 -0500
Subject: [PATCH v12 1/1] Optimize hex_encode() and hex_decode() using SIMD.

---
 src/backend/utils/adt/encode.c        | 135 ++++++++++++++-
 src/include/port/simd.h               | 236 +++++++++++++++++++++++++-
 src/test/regress/expected/strings.out |  58 +++++++
 src/test/regress/sql/strings.sql      |  16 ++
 4 files changed, 435 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 9a9c7e8da99..87126a003b3 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,55 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector8_shift_right(hi, 4);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
@@ -213,8 +263,8 @@ hex_decode(const char *src, size_t len, char *dst)
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
-uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+static inline uint64
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +304,83 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+/*
+ * This helper converts each byte to its binary-equivalent nibble by
+ * subtraction and combines them to form the return bytes (separated by zero
+ * bytes).  Returns false if any input bytes are outside the expected ranges of
+ * ASCII values.  Otherwise, returns true.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
+{
+	Vector8		sub;
+	Vector8		msk;
+	bool		ret;
+
+	msk = vector8_gt(vector8_broadcast('9' + 1), src);
+	sub = vector8_and(msk, vector8_broadcast('0'));
+
+	msk = vector8_gt(src, vector8_broadcast('A' - 1));
+	msk = vector8_and(msk, vector8_broadcast('A' - 10));
+	sub = vector8_add(sub, msk);
+
+	msk = vector8_gt(src, vector8_broadcast('a' - 1));
+	msk = vector8_and(msk, vector8_broadcast('a' - 'A'));
+	sub = vector8_add(sub, msk);
+
+	*dst = vector8_issub(src, sub);
+	ret = !vector8_has_ge(*dst, 0x10);
+
+	msk = vector8_and(*dst, vector8_broadcast_u32(0x0f000f00));
+	msk = vector8_shift_right(msk, 8);
+	*dst = vector8_and(*dst, vector8_broadcast_u32(0x000f000f));
+	*dst = vector8_shift_left(*dst, 4);
+	*dst = vector8_or(*dst, msk);
+	return ret;
+}
+#endif							/* ! USE_NO_SIMD */
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_NO_SIMD
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) * 2 - 1);
+	uint64		i;
+	bool		success = true;
+
+	/*
+	 * We must process 2 vectors at a time since the output will be half the
+	 * length of the input.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8) * 2)
+	{
+		Vector8		srcv;
+		Vector8		dstv1;
+		Vector8		dstv2;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+		success &= hex_decode_simd_helper(srcv, &dstv1);
+
+		vector8_load(&srcv, (const uint8 *) &src[i + sizeof(Vector8)]);
+		success &= hex_decode_simd_helper(srcv, &dstv2);
+
+		vector8_store((uint8 *) &dst[i / 2], vector8_pack_16(dstv1, dstv2));
+	}
+
+	/*
+	 * If something didn't look right in the vector path, try again in the
+	 * scalar path so that we can handle it correctly.
+	 */
+	if (!success)
+		i = 0;
+
+	return i / 2 + hex_decode_safe_scalar(src + i, len - i, dst + i / 2, escontext);
+#endif
+}
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..0261179e9e7 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -86,7 +86,7 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
-static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_ussub(const Vector8 v1, const Vector8 v2);
 #endif
 
 /*
@@ -128,6 +128,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -155,6 +170,24 @@ vector32_broadcast(const uint32 c)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Some compilers are picky about casts to the same underlying type, and others
+ * are picky about implicit conversions with vector types.  This function does
+ * the same thing as vector32_broadcast(), but it returns a Vector8 and is
+ * carefully crafted to avoid compiler indigestion.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_broadcast_u32(const uint32 c)
+{
+#ifdef USE_SSE2
+	return vector32_broadcast(c);
+#elif defined(USE_NEON)
+	return (Vector8) vector32_broadcast(c);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if any elements in the vector are equal to the given scalar.
  */
@@ -257,13 +290,32 @@ vector8_has_le(const Vector8 v, const uint8 c)
 	 * NUL bytes.  This approach is a workaround for the lack of unsigned
 	 * comparison instructions on some architectures.
 	 */
-	result = vector8_has_zero(vector8_ssub(v, vector8_broadcast(c)));
+	result = vector8_has_zero(vector8_ussub(v, vector8_broadcast(c)));
 #endif
 
 	Assert(assert_result == result);
 	return result;
 }
 
+/*
+ * Returns true if any elements in the vector are greater than or equal to the
+ * given scalar.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+	Vector8		umax = _mm_max_epu8(v, vector8_broadcast(c));
+	Vector8		cmpe = _mm_cmpeq_epi8(umax, v);
+
+	return vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+	return vmaxvq_u8(v) >= c;
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if the high bit of any element is set
  */
@@ -358,15 +410,65 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
- * vectors using saturation (i.e., if the operation would yield a value less
- * than zero, zero is returned instead).  For more information on saturation
- * arithmetic, see https://en.wikipedia.org/wiki/Saturation_arithmetic
+ * vectors using signed saturation (i.e., if the operation would yield a value
+ * less than -128, -128 is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
  */
 #ifndef USE_NO_SIMD
 static inline Vector8
-vector8_ssub(const Vector8 v1, const Vector8 v2)
+vector8_issub(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_subs_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return (Vector8) vqsubq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of subtracting the respective elements of the input
+ * vectors using unsigned saturation (i.e., if the operation would yield a
+ * value less than zero, zero is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_ussub(const Vector8 v1, const Vector8 v2)
 {
 #ifdef USE_SSE2
 	return _mm_subs_epu8(v1, v2);
@@ -404,6 +506,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +538,109 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves of given vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  NB: The upper 8-bits of each 16-bit element must be zeros, else
+ * this will produce different results on different architectures.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_pack_16(const Vector8 v1, const Vector8 v2)
+{
+	Vector32	mask PG_USED_FOR_ASSERTS_ONLY = vector32_broadcast(0xff00ff00);
+
+	Assert(!vector8_has_ge(vector8_and(v1, mask), 1));
+	Assert(!vector8_has_ge(vector8_and(v2, mask), 1));
+#ifdef USE_SSE2
+	return _mm_packus_epi16(v1, v2);
+#elif defined(USE_NEON)
+	return vuzp1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift left of each 32-bit element in the vector by "i" bits.
+ *
+ * XXX AArch64 requires an integer literal, so we have to list all expected
+ * values of "i" from all callers in a switch statement.  If you add a new
+ * caller, be sure your expected values of "i" are handled.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_left(const Vector8 v1, int i)
+{
+#ifdef USE_SSE2
+	return _mm_slli_epi32(v1, i);
+#elif defined(USE_NEON)
+	switch (i)
+	{
+		case 4:
+			return (Vector8) vshlq_n_u32((Vector32) v1, 4);
+		default:
+			pg_unreachable();
+			return vector8_broadcast(0);
+	}
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each 32-bit element in the vector by "i" bits.
+ *
+ * XXX AArch64 requires an integer literal, so we have to list all expected
+ * values of "i" from all callers in a switch statement.  If you add a new
+ * caller, be sure your expected values of "i" are handled.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_right(const Vector8 v1, int i)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, i);
+#elif defined(USE_NEON)
+	switch (i)
+	{
+		case 4:
+			return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+		case 8:
+			return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+		default:
+			pg_unreachable();
+			return vector8_broadcast(0);
+	}
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 691e475bce3..3e351cf1cd0 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -260,6 +260,64 @@ SELECT reverse('\xabcd'::bytea);
  \xcdab
 (1 row)
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+ bytea 
+-------
+ \x
+(1 row)
+
+SELECT ('\x' || repeat('!', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "!"
+SELECT ('\x' || repeat('/', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "/"
+SELECT ('\x' || repeat('0', 32))::bytea;
+               bytea                
+------------------------------------
+ \x00000000000000000000000000000000
+(1 row)
+
+SELECT ('\x' || repeat('9', 32))::bytea;
+               bytea                
+------------------------------------
+ \x99999999999999999999999999999999
+(1 row)
+
+SELECT ('\x' || repeat(':', 33))::bytea;
+ERROR:  invalid hexadecimal digit: ":"
+SELECT ('\x' || repeat('@', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "@"
+SELECT ('\x' || repeat('A', 32))::bytea;
+               bytea                
+------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('F', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('G', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "G"
+SELECT ('\x' || repeat('`', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "`"
+SELECT ('\x' || repeat('a', 32))::bytea;
+               bytea                
+------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('f', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('g', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "g"
+SELECT ('\x' || repeat('~', 33))::bytea;
+ERROR:  invalid hexadecimal digit: "~"
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
       bytea       
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index c05f3413699..35910369b6f 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -82,6 +82,22 @@ SELECT reverse(''::bytea);
 SELECT reverse('\xaa'::bytea);
 SELECT reverse('\xabcd'::bytea);
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+SELECT ('\x' || repeat('!', 33))::bytea;
+SELECT ('\x' || repeat('/', 33))::bytea;
+SELECT ('\x' || repeat('0', 32))::bytea;
+SELECT ('\x' || repeat('9', 32))::bytea;
+SELECT ('\x' || repeat(':', 33))::bytea;
+SELECT ('\x' || repeat('@', 33))::bytea;
+SELECT ('\x' || repeat('A', 32))::bytea;
+SELECT ('\x' || repeat('F', 32))::bytea;
+SELECT ('\x' || repeat('G', 33))::bytea;
+SELECT ('\x' || repeat('`', 33))::bytea;
+SELECT ('\x' || repeat('a', 32))::bytea;
+SELECT ('\x' || repeat('f', 32))::bytea;
+SELECT ('\x' || repeat('g', 33))::bytea;
+SELECT ('\x' || repeat('~', 33))::bytea;
+
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
 SELECT E'\\x De Ad Be Ef '::bytea;
-- 
2.39.5 (Apple Git-154)

#44

John Naylor

johncnaylorls@gmail.com

4 months ago

In reply to: Nathan Bossart (#43)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

I think so. I doubt there's any performance advantage, but it could be
nice for code cleanup. (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.")

Yes right, sorry. And it seems good to do such cleanup first, since it
doesn't make sense to rename something that is about to be deleted.

I think I disagree on this one. We're not broadcasting arbitrary bytes for
every vector element, we're broadcasting a patten of bytes that happens to
be wider than the element size. I would expect this to be a relatively
common use-case.

That's probably true. I'm still worried that the hack for working
around compiler pickiness (while nice enough in it's current form)
might break at some point and require awareness of compiler versions.

Hmm, for this case, we can sidestep the maintainability questions
entirely by instead using the new interleave functions to build the
masks:

vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f))
vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero())

This generates identical code as v12 on Arm and is not bad on x86.
What do you think of the attached?

While looking around again, it looks like the "msk" variable isn't a
mask like the implies to me. Not sure of a better name because I'm not
sure what it represents aside from a temp variable.

+#elif defined(USE_NEON)
+ switch (i)
+ {
+ case 4:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+ case 8:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+ default:
+ pg_unreachable();
+ return vector8_broadcast(0);
+ }

This is just a compiler hint, so if the input is not handled I think
it will return the wrong answer rather than alerting the developer, so
we probabaly want "Assert(false)" here.

Other than that, the pack/unpack functions could use some
documentation about which parameter is low/high.

--
John Naylor
Amazon Web Services

Attachments:

interleave-mask.patch.txttext/plain; charset=US-ASCII; name=interleave-mask.patch.txtDownload

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 87126a003b3..db734481a60 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -315,6 +315,8 @@ static inline bool
 hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
 {
 	Vector8		sub;
+	Vector8		mask_hi = vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f));
+	Vector8		mask_lo = vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero());
 	Vector8		msk;
 	bool		ret;
 
@@ -332,9 +334,9 @@ hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
 	*dst = vector8_issub(src, sub);
 	ret = !vector8_has_ge(*dst, 0x10);
 
-	msk = vector8_and(*dst, vector8_broadcast_u32(0x0f000f00));
+	msk = vector8_and(*dst, mask_hi);
 	msk = vector8_shift_right(msk, 8);
-	*dst = vector8_and(*dst, vector8_broadcast_u32(0x000f000f));
+	*dst = vector8_and(*dst, mask_lo);
 	*dst = vector8_shift_left(*dst, 4);
 	*dst = vector8_or(*dst, msk);
 	return ret;
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 0261179e9e7..e316233b7aa 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -101,6 +101,17 @@ static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
+/* return a zeroed register */
+static inline Vector8
+vector8_zero()
+{
+#if defined(USE_SSE2)
+	return _mm_setzero_si128();
+#elif defined(USE_NEON)
+	return vmovq_n_u8(0);
+#endif
+}
+
 /*
  * Load a chunk of memory into the given vector.
  */
@@ -170,24 +181,6 @@ vector32_broadcast(const uint32 c)
 }
 #endif							/* ! USE_NO_SIMD */
 
-/*
- * Some compilers are picky about casts to the same underlying type, and others
- * are picky about implicit conversions with vector types.  This function does
- * the same thing as vector32_broadcast(), but it returns a Vector8 and is
- * carefully crafted to avoid compiler indigestion.
- */
-#ifndef USE_NO_SIMD
-static inline Vector8
-vector8_broadcast_u32(const uint32 c)
-{
-#ifdef USE_SSE2
-	return vector32_broadcast(c);
-#elif defined(USE_NEON)
-	return (Vector8) vector32_broadcast(c);
-#endif
-}
-#endif							/* ! USE_NO_SIMD */
-
 /*
  * Return true if any elements in the vector are equal to the given scalar.
  */
@@ -577,8 +570,9 @@ vector8_interleave_high(const Vector8 v1, const Vector8 v2)
 static inline Vector8
 vector8_pack_16(const Vector8 v1, const Vector8 v2)
 {
-	Vector32	mask PG_USED_FOR_ASSERTS_ONLY = vector32_broadcast(0xff00ff00);
+	Vector8		mask PG_USED_FOR_ASSERTS_ONLY;
 
+	mask = vector8_interleave_low(vector8_zero(), vector8_broadcast(0xff));
 	Assert(!vector8_has_ge(vector8_and(v1, mask), 1));
 	Assert(!vector8_has_ge(vector8_and(v2, mask), 1));
 #ifdef USE_SSE2

#45

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#44)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Mon, Sep 29, 2025 at 03:45:27PM +0700, John Naylor wrote:

On Fri, Sep 26, 2025 at 1:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Thu, Sep 25, 2025 at 09:16:35PM +0700, John Naylor wrote:

(Separately, now I'm wondering if we can do the same for
vector8_has_le since _mm_min_epu8 and vminvq_u8 both exist, and that
would allow getting rid of )

I think so. I doubt there's any performance advantage, but it could be
nice for code cleanup. (I'm assuming you meant to say vector8_ssub
(renamed to vector8_ussub() in the patch) after "getting rid of.")

Yes right, sorry. And it seems good to do such cleanup first, since it
doesn't make sense to rename something that is about to be deleted.

Will do. I'll plan on committing the other patch [0]/messages/by-id/attachment/182185/v3-0001-Optimize-vector8_has_le-on-AArch64.patch soon.

Hmm, for this case, we can sidestep the maintainability questions
entirely by instead using the new interleave functions to build the
masks:

vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f))
vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero())

This generates identical code as v12 on Arm and is not bad on x86.
What do you think of the attached?

WFM

While looking around again, it looks like the "msk" variable isn't a
mask like the implies to me. Not sure of a better name because I'm not
sure what it represents aside from a temp variable.

Renamed to "tmp".

+#elif defined(USE_NEON)
+ switch (i)
+ {
+ case 4:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+ case 8:
+ return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+ default:
+ pg_unreachable();
+ return vector8_broadcast(0);
+ }
This is just a compiler hint, so if the input is not handled I think
it will return the wrong answer rather than alerting the developer, so
we probabaly want "Assert(false)" here.

Fixed.

Other than that, the pack/unpack functions could use some
documentation about which parameter is low/high.

Added.

[0]: /messages/by-id/attachment/182185/v3-0001-Optimize-vector8_has_le-on-AArch64.patch

--
nathan

Attachments:

v13-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From 5affab26c2c73a2ae916ba3907a675705698ef83 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 12 Sep 2025 15:51:55 -0500
Subject: [PATCH v13 1/1] Optimize hex_encode() and hex_decode() using SIMD.

---
 src/backend/utils/adt/encode.c        | 137 ++++++++++++++-
 src/include/port/simd.h               | 237 +++++++++++++++++++++++++-
 src/test/regress/expected/strings.out |  58 +++++++
 src/test/regress/sql/strings.sql      |  16 ++
 4 files changed, 438 insertions(+), 10 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 9a9c7e8da99..eaccc87753f 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,55 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This works by splitting the high and low nibbles of each byte into
+	 * separate vectors, adding the vectors to a mask that converts the
+	 * nibbles to their equivalent ASCII bytes, and interleaving those bytes
+	 * back together to form the final hex-encoded string.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector8_shift_right(hi, 4);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
@@ -213,8 +263,8 @@ hex_decode(const char *src, size_t len, char *dst)
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
-uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+static inline uint64
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +304,85 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+/*
+ * This helper converts each byte to its binary-equivalent nibble by
+ * subtraction and combines them to form the return bytes (separated by zero
+ * bytes).  Returns false if any input bytes are outside the expected ranges of
+ * ASCII values.  Otherwise, returns true.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
+{
+	Vector8		sub;
+	Vector8		mask_hi = vector8_interleave_low(vector8_zero(), vector8_broadcast(0x0f));
+	Vector8		mask_lo = vector8_interleave_low(vector8_broadcast(0x0f), vector8_zero());
+	Vector8		tmp;
+	bool		ret;
+
+	tmp = vector8_gt(vector8_broadcast('9' + 1), src);
+	sub = vector8_and(tmp, vector8_broadcast('0'));
+
+	tmp = vector8_gt(src, vector8_broadcast('A' - 1));
+	tmp = vector8_and(tmp, vector8_broadcast('A' - 10));
+	sub = vector8_add(sub, tmp);
+
+	tmp = vector8_gt(src, vector8_broadcast('a' - 1));
+	tmp = vector8_and(tmp, vector8_broadcast('a' - 'A'));
+	sub = vector8_add(sub, tmp);
+
+	*dst = vector8_issub(src, sub);
+	ret = !vector8_has_ge(*dst, 0x10);
+
+	tmp = vector8_and(*dst, mask_hi);
+	tmp = vector8_shift_right(tmp, 8);
+	*dst = vector8_and(*dst, mask_lo);
+	*dst = vector8_shift_left(*dst, 4);
+	*dst = vector8_or(*dst, tmp);
+	return ret;
+}
+#endif							/* ! USE_NO_SIMD */
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_NO_SIMD
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) * 2 - 1);
+	uint64		i;
+	bool		success = true;
+
+	/*
+	 * We must process 2 vectors at a time since the output will be half the
+	 * length of the input.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8) * 2)
+	{
+		Vector8		srcv;
+		Vector8		dstv1;
+		Vector8		dstv2;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+		success &= hex_decode_simd_helper(srcv, &dstv1);
+
+		vector8_load(&srcv, (const uint8 *) &src[i + sizeof(Vector8)]);
+		success &= hex_decode_simd_helper(srcv, &dstv2);
+
+		vector8_store((uint8 *) &dst[i / 2], vector8_pack_16(dstv1, dstv2));
+	}
+
+	/*
+	 * If something didn't look right in the vector path, try again in the
+	 * scalar path so that we can handle it correctly.
+	 */
+	if (!success)
+		i = 0;
+
+	return i / 2 + hex_decode_safe_scalar(src + i, len - i, dst + i / 2, escontext);
+#endif
+}
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 97c5f353022..ee6888458db 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -86,7 +86,7 @@ static inline uint32 vector8_highbit_mask(const Vector8 v);
 static inline Vector8 vector8_or(const Vector8 v1, const Vector8 v2);
 #ifndef USE_NO_SIMD
 static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
-static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_ussub(const Vector8 v1, const Vector8 v2);
 #endif
 
 /*
@@ -101,6 +101,19 @@ static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
+/*
+ * Return a zeroed register.
+ */
+static inline Vector8
+vector8_zero()
+{
+#if defined(USE_SSE2)
+	return _mm_setzero_si128();
+#elif defined(USE_NEON)
+	return vmovq_n_u8(0);
+#endif
+}
+
 /*
  * Load a chunk of memory into the given vector.
  */
@@ -128,6 +141,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -257,13 +285,32 @@ vector8_has_le(const Vector8 v, const uint8 c)
 	 * NUL bytes.  This approach is a workaround for the lack of unsigned
 	 * comparison instructions on some architectures.
 	 */
-	result = vector8_has_zero(vector8_ssub(v, vector8_broadcast(c)));
+	result = vector8_has_zero(vector8_ussub(v, vector8_broadcast(c)));
 #endif
 
 	Assert(assert_result == result);
 	return result;
 }
 
+/*
+ * Returns true if any elements in the vector are greater than or equal to the
+ * given scalar.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+	Vector8		umax = _mm_max_epu8(v, vector8_broadcast(c));
+	Vector8		cmpe = _mm_cmpeq_epi8(umax, v);
+
+	return vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+	return vmaxvq_u8(v) >= c;
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if the high bit of any element is set
  */
@@ -358,15 +405,65 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the result of subtracting the respective elements of the input
- * vectors using saturation (i.e., if the operation would yield a value less
- * than zero, zero is returned instead).  For more information on saturation
- * arithmetic, see https://en.wikipedia.org/wiki/Saturation_arithmetic
+ * vectors using signed saturation (i.e., if the operation would yield a value
+ * less than -128, -128 is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
  */
 #ifndef USE_NO_SIMD
 static inline Vector8
-vector8_ssub(const Vector8 v1, const Vector8 v2)
+vector8_issub(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_subs_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return (Vector8) vqsubq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of subtracting the respective elements of the input
+ * vectors using unsigned saturation (i.e., if the operation would yield a
+ * value less than zero, zero is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_ussub(const Vector8 v1, const Vector8 v2)
 {
 #ifdef USE_SSE2
 	return _mm_subs_epu8(v1, v2);
@@ -404,6 +501,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -419,4 +533,115 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves (e.g., for SSE2, bits 0-63) of given
+ * vectors.  Bytes 0, 2, 4, etc. use v1, and bytes 1, 3, 5, etc. use v2.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves (e.g., for SSE2, bits 64-127) of given
+ * vectors.  Bytes 0, 2, 4, etc. use v1, and bytes 1, 3, 5, etc. use v2.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  The first half of the return vector (e.g., for SSE2, bits 0-63)
+ * uses v1, and the second half (e.g., for SSE2, bits 64-127) uses v2.
+ *
+ * NB: The upper 8-bits of each 16-bit element must be zeros, else this will
+ * produce different results on different architectures.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_pack_16(const Vector8 v1, const Vector8 v2)
+{
+	Vector8		mask PG_USED_FOR_ASSERTS_ONLY;
+
+	mask = vector8_interleave_low(vector8_zero(), vector8_broadcast(0xff));
+	Assert(!vector8_has_ge(vector8_and(v1, mask), 1));
+	Assert(!vector8_has_ge(vector8_and(v2, mask), 1));
+#ifdef USE_SSE2
+	return _mm_packus_epi16(v1, v2);
+#elif defined(USE_NEON)
+	return vuzp1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift left of each 32-bit element in the vector by "i" bits.
+ *
+ * XXX AArch64 requires an integer literal, so we have to list all expected
+ * values of "i" from all callers in a switch statement.  If you add a new
+ * caller, be sure your expected values of "i" are handled.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_left(const Vector8 v1, int i)
+{
+#ifdef USE_SSE2
+	return _mm_slli_epi32(v1, i);
+#elif defined(USE_NEON)
+	switch (i)
+	{
+		case 4:
+			return (Vector8) vshlq_n_u32((Vector32) v1, 4);
+		default:
+			Assert(false);
+			return vector8_broadcast(0);
+	}
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each 32-bit element in the vector by "i" bits.
+ *
+ * XXX AArch64 requires an integer literal, so we have to list all expected
+ * values of "i" from all callers in a switch statement.  If you add a new
+ * caller, be sure your expected values of "i" are handled.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_right(const Vector8 v1, int i)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, i);
+#elif defined(USE_NEON)
+	switch (i)
+	{
+		case 4:
+			return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+		case 8:
+			return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+		default:
+			Assert(false);
+			return vector8_broadcast(0);
+	}
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 691e475bce3..b9dc08d5f61 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -260,6 +260,64 @@ SELECT reverse('\xabcd'::bytea);
  \xcdab
 (1 row)
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+ bytea 
+-------
+ \x
+(1 row)
+
+SELECT ('\x' || repeat('!', 32))::bytea;
+ERROR:  invalid hexadecimal digit: "!"
+SELECT ('\x' || repeat('/', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "/"
+SELECT ('\x' || repeat('0', 34))::bytea;
+                bytea                 
+--------------------------------------
+ \x0000000000000000000000000000000000
+(1 row)
+
+SELECT ('\x' || repeat('9', 32))::bytea;
+               bytea                
+------------------------------------
+ \x99999999999999999999999999999999
+(1 row)
+
+SELECT ('\x' || repeat(':', 32))::bytea;
+ERROR:  invalid hexadecimal digit: ":"
+SELECT ('\x' || repeat('@', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "@"
+SELECT ('\x' || repeat('A', 34))::bytea;
+                bytea                 
+--------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('F', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('G', 32))::bytea;
+ERROR:  invalid hexadecimal digit: "G"
+SELECT ('\x' || repeat('`', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "`"
+SELECT ('\x' || repeat('a', 34))::bytea;
+                bytea                 
+--------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('f', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('g', 32))::bytea;
+ERROR:  invalid hexadecimal digit: "g"
+SELECT ('\x' || repeat('~', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "~"
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
       bytea       
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index c05f3413699..a2a91523404 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -82,6 +82,22 @@ SELECT reverse(''::bytea);
 SELECT reverse('\xaa'::bytea);
 SELECT reverse('\xabcd'::bytea);
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+SELECT ('\x' || repeat('!', 32))::bytea;
+SELECT ('\x' || repeat('/', 34))::bytea;
+SELECT ('\x' || repeat('0', 34))::bytea;
+SELECT ('\x' || repeat('9', 32))::bytea;
+SELECT ('\x' || repeat(':', 32))::bytea;
+SELECT ('\x' || repeat('@', 34))::bytea;
+SELECT ('\x' || repeat('A', 34))::bytea;
+SELECT ('\x' || repeat('F', 32))::bytea;
+SELECT ('\x' || repeat('G', 32))::bytea;
+SELECT ('\x' || repeat('`', 34))::bytea;
+SELECT ('\x' || repeat('a', 34))::bytea;
+SELECT ('\x' || repeat('f', 32))::bytea;
+SELECT ('\x' || repeat('g', 32))::bytea;
+SELECT ('\x' || repeat('~', 34))::bytea;
+
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
 SELECT E'\\x De Ad Be Ef '::bytea;
-- 
2.39.5 (Apple Git-154)

#46

John Naylor

johncnaylorls@gmail.com

3 months ago

In reply to: Nathan Bossart (#45)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Oct 3, 2025 at 12:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

[v13]

LGTM, but I went back and checked if vector8_zero() actually does
anything different than vector8_boadcast(0), and in fact it doesn't on
compilers we support for either x86 or Arm. I pulled the former out
from older work, but it seems irrelevant now. Pardon the noise.

--
John Naylor
Amazon Web Services

#47

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: John Naylor (#46)

1 attachment(s)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Oct 03, 2025 at 02:36:47PM +0700, John Naylor wrote:

LGTM, but I went back and checked if vector8_zero() actually does
anything different than vector8_boadcast(0), and in fact it doesn't on
compilers we support for either x86 or Arm. I pulled the former out
from older work, but it seems irrelevant now. Pardon the noise.

Here is what I have staged for commit.

--
nathan

Attachments:

v14-0001-Optimize-hex_encode-and-hex_decode-using-SIMD.patchtext/plain; charset=us-asciiDownload

From 90d4bab0c777e38cf7c69c0b33c1a2e583c0c249 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 3 Oct 2025 15:25:04 -0500
Subject: [PATCH v14 1/1] Optimize hex_encode() and hex_decode() using SIMD.

This commit adds new implementations of hex_encode() and
hex_decode() that use routines (both existing and new) from simd.h.
Testing indicates these new implementations are much faster for
larger inputs.  For smaller inputs and for hex_decode() inputs
containing bytes outside of [0-9A-Fa-f], we use the existing scalar
versions.  Since we are using simd.h routines, these optimizations
apply to both x86-64 and AArch64.

Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Chiranmoy Bhattacharya <chiranmoy.bhattacharya@fujitsu.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/aLhVWTRy0QPbW2tl%40nathan
---
 src/backend/utils/adt/encode.c        | 137 ++++++++++++++++-
 src/include/port/simd.h               | 211 ++++++++++++++++++++++++++
 src/test/regress/expected/strings.out |  58 +++++++
 src/test/regress/sql/strings.sql      |  16 ++
 4 files changed, 418 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/encode.c b/src/backend/utils/adt/encode.c
index 9a9c7e8da99..aabe9913eee 100644
--- a/src/backend/utils/adt/encode.c
+++ b/src/backend/utils/adt/encode.c
@@ -16,6 +16,7 @@
 #include <ctype.h>
 
 #include "mb/pg_wchar.h"
+#include "port/simd.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "varatt.h"
@@ -177,8 +178,8 @@ static const int8 hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
 };
 
-uint64
-hex_encode(const char *src, size_t len, char *dst)
+static inline uint64
+hex_encode_scalar(const char *src, size_t len, char *dst)
 {
 	const char *end = src + len;
 
@@ -193,6 +194,55 @@ hex_encode(const char *src, size_t len, char *dst)
 	return (uint64) len * 2;
 }
 
+uint64
+hex_encode(const char *src, size_t len, char *dst)
+{
+#ifdef USE_NO_SIMD
+	return hex_encode_scalar(src, len, dst);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) - 1);
+	uint64		i;
+
+	/*
+	 * This splits the high and low nibbles of each byte into separate
+	 * vectors, adds the vectors to a mask that converts the nibbles to their
+	 * equivalent ASCII bytes, and interleaves those bytes back together to
+	 * form the final hex-encoded string.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		Vector8		srcv;
+		Vector8		lo;
+		Vector8		hi;
+		Vector8		mask;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+
+		lo = vector8_and(srcv, vector8_broadcast(0x0f));
+		mask = vector8_gt(lo, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		lo = vector8_add(lo, mask);
+
+		hi = vector8_and(srcv, vector8_broadcast(0xf0));
+		hi = vector8_shift_right(hi, 4);
+		mask = vector8_gt(hi, vector8_broadcast(0x9));
+		mask = vector8_and(mask, vector8_broadcast('a' - '0' - 10));
+		mask = vector8_add(mask, vector8_broadcast('0'));
+		hi = vector8_add(hi, mask);
+
+		vector8_store((uint8 *) &dst[i * 2],
+					  vector8_interleave_low(hi, lo));
+		vector8_store((uint8 *) &dst[i * 2 + sizeof(Vector8)],
+					  vector8_interleave_high(hi, lo));
+	}
+
+	(void) hex_encode_scalar(src + i, len - i, dst + i * 2);
+
+	return (uint64) len * 2;
+#endif
+}
+
 static inline bool
 get_hex(const char *cp, char *out)
 {
@@ -213,8 +263,8 @@ hex_decode(const char *src, size_t len, char *dst)
 	return hex_decode_safe(src, len, dst, NULL);
 }
 
-uint64
-hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+static inline uint64
+hex_decode_safe_scalar(const char *src, size_t len, char *dst, Node *escontext)
 {
 	const char *s,
 			   *srcend;
@@ -254,6 +304,85 @@ hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
 	return p - dst;
 }
 
+/*
+ * This helper converts each byte to its binary-equivalent nibble by
+ * subtraction and combines them to form the return bytes (separated by zero
+ * bytes).  Returns false if any input bytes are outside the expected ranges of
+ * ASCII values.  Otherwise, returns true.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+hex_decode_simd_helper(const Vector8 src, Vector8 *dst)
+{
+	Vector8		sub;
+	Vector8		mask_hi = vector8_interleave_low(vector8_broadcast(0), vector8_broadcast(0x0f));
+	Vector8		mask_lo = vector8_interleave_low(vector8_broadcast(0x0f), vector8_broadcast(0));
+	Vector8		tmp;
+	bool		ret;
+
+	tmp = vector8_gt(vector8_broadcast('9' + 1), src);
+	sub = vector8_and(tmp, vector8_broadcast('0'));
+
+	tmp = vector8_gt(src, vector8_broadcast('A' - 1));
+	tmp = vector8_and(tmp, vector8_broadcast('A' - 10));
+	sub = vector8_add(sub, tmp);
+
+	tmp = vector8_gt(src, vector8_broadcast('a' - 1));
+	tmp = vector8_and(tmp, vector8_broadcast('a' - 'A'));
+	sub = vector8_add(sub, tmp);
+
+	*dst = vector8_issub(src, sub);
+	ret = !vector8_has_ge(*dst, 0x10);
+
+	tmp = vector8_and(*dst, mask_hi);
+	tmp = vector8_shift_right(tmp, 8);
+	*dst = vector8_and(*dst, mask_lo);
+	*dst = vector8_shift_left(*dst, 4);
+	*dst = vector8_or(*dst, tmp);
+	return ret;
+}
+#endif							/* ! USE_NO_SIMD */
+
+uint64
+hex_decode_safe(const char *src, size_t len, char *dst, Node *escontext)
+{
+#ifdef USE_NO_SIMD
+	return hex_decode_safe_scalar(src, len, dst, escontext);
+#else
+	const uint64 tail_idx = len & ~(sizeof(Vector8) * 2 - 1);
+	uint64		i;
+	bool		success = true;
+
+	/*
+	 * We must process 2 vectors at a time since the output will be half the
+	 * length of the input.
+	 */
+	for (i = 0; i < tail_idx; i += sizeof(Vector8) * 2)
+	{
+		Vector8		srcv;
+		Vector8		dstv1;
+		Vector8		dstv2;
+
+		vector8_load(&srcv, (const uint8 *) &src[i]);
+		success &= hex_decode_simd_helper(srcv, &dstv1);
+
+		vector8_load(&srcv, (const uint8 *) &src[i + sizeof(Vector8)]);
+		success &= hex_decode_simd_helper(srcv, &dstv2);
+
+		vector8_store((uint8 *) &dst[i / 2], vector8_pack_16(dstv1, dstv2));
+	}
+
+	/*
+	 * If something didn't look right in the vector path, try again in the
+	 * scalar path so that we can handle it correctly.
+	 */
+	if (!success)
+		i = 0;
+
+	return i / 2 + hex_decode_safe_scalar(src + i, len - i, dst + i / 2, escontext);
+#endif
+}
+
 static uint64
 hex_enc_len(const char *src, size_t srclen)
 {
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 5f5737707a8..b0165b45861 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -127,6 +127,21 @@ vector32_load(Vector32 *v, const uint32 *s)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Store a vector into the given memory address.
+ */
+#ifndef USE_NO_SIMD
+static inline void
+vector8_store(uint8 *s, Vector8 v)
+{
+#ifdef USE_SSE2
+	_mm_storeu_si128((Vector8 *) s, v);
+#elif defined(USE_NEON)
+	vst1q_u8(s, v);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Create a vector with all elements set to the same value.
  */
@@ -265,6 +280,25 @@ vector8_has_le(const Vector8 v, const uint8 c)
 	return result;
 }
 
+/*
+ * Returns true if any elements in the vector are greater than or equal to the
+ * given scalar.
+ */
+#ifndef USE_NO_SIMD
+static inline bool
+vector8_has_ge(const Vector8 v, const uint8 c)
+{
+#ifdef USE_SSE2
+	Vector8		umax = _mm_max_epu8(v, vector8_broadcast(c));
+	Vector8		cmpe = vector8_eq(umax, v);
+
+	return vector8_is_highbit_set(cmpe);
+#elif defined(USE_NEON)
+	return vmaxvq_u8(v) >= c;
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return true if the high bit of any element is set
  */
@@ -359,6 +393,55 @@ vector32_or(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return the bitwise AND of the inputs.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_and(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_and_si128(v1, v2);
+#elif defined(USE_NEON)
+	return vandq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of adding the respective elements of the input vectors.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_add(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_add_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vaddq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Return the result of subtracting the respective elements of the input
+ * vectors using signed saturation (i.e., if the operation would yield a value
+ * less than -128, -128 is returned instead).  For more information on
+ * saturation arithmetic, see
+ * https://en.wikipedia.org/wiki/Saturation_arithmetic
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_issub(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_subs_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return (Vector8) vqsubq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return a vector with all bits set in each lane where the corresponding
  * lanes in the inputs are equal.
@@ -387,6 +470,23 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a vector with all bits set for each lane of v1 that is greater than
+ * the corresponding lane of v2.  NB: The comparison treats the elements as
+ * signed.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_gt(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpgt_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vcgtq_s8((int8x16_t) v1, (int8x16_t) v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Given two vectors, return a vector with the minimum element of each.
  */
@@ -402,4 +502,115 @@ vector8_min(const Vector8 v1, const Vector8 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Interleave elements of low halves (e.g., for SSE2, bits 0-63) of given
+ * vectors.  Bytes 0, 2, 4, etc. use v1, and bytes 1, 3, 5, etc. use v2.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_low(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpacklo_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Interleave elements of high halves (e.g., for SSE2, bits 64-127) of given
+ * vectors.  Bytes 0, 2, 4, etc. use v1, and bytes 1, 3, 5, etc. use v2.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_interleave_high(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_unpackhi_epi8(v1, v2);
+#elif defined(USE_NEON)
+	return vzip2q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Pack 16-bit elements in the given vectors into a single vector of 8-bit
+ * elements.  The first half of the return vector (e.g., for SSE2, bits 0-63)
+ * uses v1, and the second half (e.g., for SSE2, bits 64-127) uses v2.
+ *
+ * NB: The upper 8-bits of each 16-bit element must be zeros, else this will
+ * produce different results on different architectures.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_pack_16(const Vector8 v1, const Vector8 v2)
+{
+	Vector8		mask PG_USED_FOR_ASSERTS_ONLY;
+
+	mask = vector8_interleave_low(vector8_broadcast(0), vector8_broadcast(0xff));
+	Assert(!vector8_has_ge(vector8_and(v1, mask), 1));
+	Assert(!vector8_has_ge(vector8_and(v2, mask), 1));
+#ifdef USE_SSE2
+	return _mm_packus_epi16(v1, v2);
+#elif defined(USE_NEON)
+	return vuzp1q_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift left of each 32-bit element in the vector by "i" bits.
+ *
+ * XXX AArch64 requires an integer literal, so we have to list all expected
+ * values of "i" from all callers in a switch statement.  If you add a new
+ * caller, be sure your expected values of "i" are handled.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_left(const Vector8 v1, int i)
+{
+#ifdef USE_SSE2
+	return _mm_slli_epi32(v1, i);
+#elif defined(USE_NEON)
+	switch (i)
+	{
+		case 4:
+			return (Vector8) vshlq_n_u32((Vector32) v1, 4);
+		default:
+			Assert(false);
+			return vector8_broadcast(0);
+	}
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
+/*
+ * Unsigned shift right of each 32-bit element in the vector by "i" bits.
+ *
+ * XXX AArch64 requires an integer literal, so we have to list all expected
+ * values of "i" from all callers in a switch statement.  If you add a new
+ * caller, be sure your expected values of "i" are handled.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_shift_right(const Vector8 v1, int i)
+{
+#ifdef USE_SSE2
+	return _mm_srli_epi32(v1, i);
+#elif defined(USE_NEON)
+	switch (i)
+	{
+		case 4:
+			return (Vector8) vshrq_n_u32((Vector32) v1, 4);
+		case 8:
+			return (Vector8) vshrq_n_u32((Vector32) v1, 8);
+		default:
+			Assert(false);
+			return vector8_broadcast(0);
+	}
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 691e475bce3..b9dc08d5f61 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -260,6 +260,64 @@ SELECT reverse('\xabcd'::bytea);
  \xcdab
 (1 row)
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+ bytea 
+-------
+ \x
+(1 row)
+
+SELECT ('\x' || repeat('!', 32))::bytea;
+ERROR:  invalid hexadecimal digit: "!"
+SELECT ('\x' || repeat('/', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "/"
+SELECT ('\x' || repeat('0', 34))::bytea;
+                bytea                 
+--------------------------------------
+ \x0000000000000000000000000000000000
+(1 row)
+
+SELECT ('\x' || repeat('9', 32))::bytea;
+               bytea                
+------------------------------------
+ \x99999999999999999999999999999999
+(1 row)
+
+SELECT ('\x' || repeat(':', 32))::bytea;
+ERROR:  invalid hexadecimal digit: ":"
+SELECT ('\x' || repeat('@', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "@"
+SELECT ('\x' || repeat('A', 34))::bytea;
+                bytea                 
+--------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('F', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('G', 32))::bytea;
+ERROR:  invalid hexadecimal digit: "G"
+SELECT ('\x' || repeat('`', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "`"
+SELECT ('\x' || repeat('a', 34))::bytea;
+                bytea                 
+--------------------------------------
+ \xaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+(1 row)
+
+SELECT ('\x' || repeat('f', 32))::bytea;
+               bytea                
+------------------------------------
+ \xffffffffffffffffffffffffffffffff
+(1 row)
+
+SELECT ('\x' || repeat('g', 32))::bytea;
+ERROR:  invalid hexadecimal digit: "g"
+SELECT ('\x' || repeat('~', 34))::bytea;
+ERROR:  invalid hexadecimal digit: "~"
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
       bytea       
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index c05f3413699..a2a91523404 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -82,6 +82,22 @@ SELECT reverse(''::bytea);
 SELECT reverse('\xaa'::bytea);
 SELECT reverse('\xabcd'::bytea);
 
+SELECT ('\x' || repeat(' ', 32))::bytea;
+SELECT ('\x' || repeat('!', 32))::bytea;
+SELECT ('\x' || repeat('/', 34))::bytea;
+SELECT ('\x' || repeat('0', 34))::bytea;
+SELECT ('\x' || repeat('9', 32))::bytea;
+SELECT ('\x' || repeat(':', 32))::bytea;
+SELECT ('\x' || repeat('@', 34))::bytea;
+SELECT ('\x' || repeat('A', 34))::bytea;
+SELECT ('\x' || repeat('F', 32))::bytea;
+SELECT ('\x' || repeat('G', 32))::bytea;
+SELECT ('\x' || repeat('`', 34))::bytea;
+SELECT ('\x' || repeat('a', 34))::bytea;
+SELECT ('\x' || repeat('f', 32))::bytea;
+SELECT ('\x' || repeat('g', 32))::bytea;
+SELECT ('\x' || repeat('~', 34))::bytea;
+
 SET bytea_output TO escape;
 SELECT E'\\xDeAdBeEf'::bytea;
 SELECT E'\\x De Ad Be Ef '::bytea;
-- 
2.39.5 (Apple Git-154)

#48

Nathan Bossart

nathandbossart@gmail.com

3 months ago

In reply to: Nathan Bossart (#47)

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

On Fri, Oct 03, 2025 at 03:33:21PM -0500, Nathan Bossart wrote:

Here is what I have staged for commit.

Committed. That seems like a good stopping point for this work, so I have
marked the associated commitfest entry as "Committed."

--
nathan