Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode
INTENTION
Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with array-oriented data should be drastically accelerated by using SIMD instructions on modern hardware.
I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)
IMPLEMENTATION GUIDELINES
The main goal ist to accelerate common cases on the most common hardware by exploiting all the resources the hardware delivers.
The following guidelines took me to a first implementation:
- restrict on 64 -bit architectures
These are the dominant server architectures, have the necessary data formats and corresponding registers and operating instructions
- start with Intel x86-64 SIMD instructions:
This is the vastly most used platform, available for development and in practical use
- don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can jump in and implement comparable algorithms
- fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
These consist mostly of a central loop without calling subroutines or doing additionally branching
- coding for maximum hardware usage instead of elegant programming
Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard implementaion in C
- isolated footprint by integrating it only in the specific subroutine (here hex-encode)
This ensures that the requirements for fast execution are met (e.g. buffer sizes) and no repeated checks are needed like in a library use case.
- trying to keep both vector execution ports always doing useful work by avoiding waits for latencies
- trying to access memory in linear fashion (reading from input buffer, writing to output buffer) to avaoid internal cache problems
- focus optimization for the most advanced SIMD instruction set: AVX512
This provides the most advanced instructions and quite a lot of large registers to aid in latency avoiding
- if possible provide fallback implementations on older SIMD standards (e.g. AVX2 or SSE2)
This is usefull on many older servers and client processors, but due to their too little number of registers latency avoiding or full execution queues cannot be fully achieved.
IMPLEMENTATION DETAILS
- The loops implementing the algorithm are written in NASM assembler:
NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions implemented and is fast.
- The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported:
This includes Linux and Windows as the most important ones
- The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
The assembly implementation gives you the full control over the registers (unlike intrinsics)
- Multiple dependency chains work interleaved to minimize latencies:
Coding is often interspersed and using almost all registers available.
- Some instructions (Moves, zeroing) are executed outside the processor execution ports:
These don’t consume execution cyles on a port but their latency has to be considered.
- Some vector instructions (multiply add) have latencies of 5 for example:
This means that after the instruction is issued, the processor has to wait 5 cycles until the result can be used in the same dependency chain. To avoid this and keep all vector execution ports (p0 and p5) busy you have to have 9 other instructions in between doing work on other streams of the algorithm to maximize hardware usage and overall performance.
- All loops are implemented as separate C-callable functions (according to the OS calling convention):
They are all leaf functions by calling no other subroutines.
- The decision which implementation is choosen is done at the caller side by a special dispatcher routine:
The caller handles the architectural capabilites (instruction sets available) and knows the required work: There is often a suitable minimum amount of work required for efficently calling a provided implementation.
- Loops should be run at least 2-4 times to compensate for initializing overhead:
This implicits a certain amount of minimum work count based on the specific SIMD implementations
- The loops terminate after detecting an error (e.g. wrong input data) and return the succesfull completed amount of work:
The standard linear implementation takes over with the already established error-handling.
- The loops work optimally with some extra output buffer space at the end to be able to overshoot in the last round:
Nonethless the correct amount of work is returned to the caller and a vector size of output buffer following the real result is zeroed out (Currently disabled!)
- the loop may preload some data after the input buffer but assures that the following page boundary is never crossed to avoid any access violation:
This makes no harm to the memory system because the output buffer has a supplemental buffer at the end, but this could be changed to leaving the tail handling to the standard implementaion if deemed unsupportable (as for now).
(to be continued...)
(continued)
PERFORMANCE
First Level: execution units (focused on AVX512)
Every modern processor has at least 2 vector execution units (p1 and p5 on Intel) which execute a different set of instructions in a pipelined fashion. Some simple classes of instructions (logical, arithmetic) can be executed on both ports. The results of a short operation are available in the next cycle for another instruction, which in its total form a dependancy chain.
Longer executions provide their results only after some more clock cycles, so the latency increases from at least 1 to higher numbers.
This constellation implies that a single dependancy chain never can exhaust the full processor capabilities. To fight these latencies multiple interleaving dependancy chains should be used.
Instructions with long latencies (e.g. memory accesses) should be issued long in advance before using their results.
In most cases only the two vector execution ports are the ultimate bottleneck, since the processor can execute memory reads, memory writes, scalar instructions and branches on other specialized units or avoid them totaly (register zeroing).
The hex_encode algorithm executes 5 instructions (uops to be correct) on p5, 3 on p1 (or arbitrary) and 1 load and 2 store uops.
Assuming a processor with 2.5 GHz (for simplicity) we have 0.5 billion vectors processed per second, which gives 64bytes*0.5 billion=32 GB maximum processed per second.
In normal database units this is really a HUGE number (and it is using only ONE core!).
But in this case the doubled amount of results inc comparison to the source has to be written to memory (64GB/sec), which exceeds the possibilities of normal desktop processors.
As another example I may present the checksum algorithm, which is only read-intensive and uses 3 uops on p5 as the bottleneck path/vector.
On a 3GHz processor checksum can process 64 GB per sec and core.
It is interesting to check the performance levels on the upcoming new generation of XEON (Sapphire Rapids in 2022), which will have much increased memory bandwith (8channels DDR5, up to 12 on AMD Genoa), which will have some special models with HBM2 memory stacks and which has 2 execution units for stores to match the read capacity of also 2 instructions/cycle.
Second Level: alignment, caches and memory
Elder processor generations had a VERY big impact for page split accesses to memory, which will occur on vector data when they are not naturally aligned.
Giving the future developments I would consider even 128 byte or 256 byte alignment, since it may be possible to get 1024 or 2048 bit vectors (already specified in ARM architecture).
On the level of caches one must consider „cache thrashing“ when the accesses to the caches exceed the associative maximum af a cache. In some algorithms (very high parallel checksum calculations with copy function) you could overload a single cacheline address in the case of too many parallel accesses. In these cases you can start the algorithm (on the fixed size blocks) a little bit delayed, so that some algorithm chains access vector n and some others vectors n+1 interleaved in the execution loop.
Memory should be accessed in natural order to maximize the use of processor cache prefetching.
All accesses should be optimized to use the registers where possible: long latencies of memory access and some initial instructions can be combined in early issued instructions used only much later in time.
The memory latencies lead to data preloading, where the data for the next round of the loop are loaded at the first possible moment when target registers are available. This is crucial for latency fighting in many algorithms.
Third level: Internal data structures
Vector operations work best with array oriented structures (here a bytea datatype or a shared buffer block for checksum calculation).
Clobbering individual scalar data (32/64 bit scalars) into vectors is much slower and really stresses the memory subsystem.
This implies a focus on more „struct of arrays“ as „array of structures“ in postgres, which seems difficult in postgres due to its established structure and long heritage.
By exploring the code more deeply (than my knowledge so far) it should be easy to identify many more places for simple algorithms working on array structures.
Fourth Level: Vertical integration
The base of most algorithm is the loading of the data into registers, doing some algorithmic calculations and write it out.
Subsequent steps are coded in another layer (e.g. copying to storage, trimming the data for output etc.). This often requires to read the data again and do some other transformations.
Vertical integration combines some simple steps for better memory utilization.
As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of these data are in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages into memory, decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By integrating all these steps into one big performance improvements can be achieved (but naturally not here in my first implementation!).
Fifth Level: Pooling
Some algorithm are so fast that they need to work on multiple datastreams at once to fully utilize a processor core. One example is checksum calculation.
To saturate the processor capabilities with large vectors you have to do the checksum on multiple pages in parallel (e.g. 2, 4 or 8).
This occurs often in real life (loading shared buffers to memory, flushing shared buffers to disk, precaching the shared buffers etc.).
Some pooling (collect up to 16 shared buffer blocks in a pool) allows a fast checksumming for blocks which are now processed in a serial fashion.
This requires some adoption at some isolated parts in the postgres code base and turns a serial procedure into a parallel processing for objects treated in the same fashion at the (nearly) same time.
BENCHMARKS:
I have included a little benchmark program. It is not very sophisticated and fancy, but allows to estimate the performance of commonly used processors.
It requires nasm to be installed/downloaded (on linux or Windows).
It executes the hexdump algorithm one million times on the binary of nasm (2.15.05 current version).
The benchmark simply runs (for about 1000 sec), the user has to count the time himself.
The binary of nasm (using it as benchmark source data) is included as the source data in
HEX_BENCH_DATA_1300KB.asm
(please adjust the location where you downloaded nasm.exe on windows).
The binary (of each architecture) has a size of 1356 KB on windows and 1718 KB on linux.
The commands to build the binary are (also found in hex_bench.asm)
on Windows:
:: commands to build on Windows (nasm and golink in the path)
nasm -f WIN64 -g hex_bench.asm -l hex_bench.lis
nasm -f WIN64 -g hex_x86_64.asm -l hex_x86_64.lis
nasm -f WIN64 -g HEX_BENCH_DATA_1300KB.asm
golink /console hex_bench.obj hex_x86_64.obj HEX_BENCH_DATA_1300KB.obj
Golink is a small utility linker on Windows found under:
on Linux:
# commands to build on LINUX
nasm -f elf64 -g hex_bench.asm -l hex_bench.lis
nasm -f elf64 -g hex_x86_64.asm -l hex_x86_64.lis
nasm -f elf64 -g HEX_BENCH_DATA_1300KB.asm
ld -o hex_bench hex_bench.o hex_x86_64.o HEX_BENCH_DATA_1300KB.o
The selected hex_encode_routine is hardcoded to hex_encode_avx512bw (please choose another implementation on processors not supporting AVX512 by changing the comments in hex_bench.asm)
The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using AVX512. This gives about 18 GB/s source-hexdumping rate on a single core!
In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB of output reduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing the copy command on such a data set.
SQL> Copy col_bytearray from my_table to 'N:/ZZ_SAV/my_hexdump.sql';
(This was on a customers dataset not reproduced here).
POSTGRES INTEGRATION (HELP NEEDED)
The architecture-dependant introduction of vector routines requires some integration efforts into Postgres.
I have designed a concept for easy integration and extensibility, but some concrete steps need support from others due to my restricted knowledge of the whole system.
(For now this global configuration is part on the top of encode.c, but it certainly must be moved to a more adequate place for initialization).
The main concept tries to match the CPU capabilities with the requirements of a certain implementation. This is not only for hex_encode but for an arbitrary number of algorithms implemented in an accelerated version (here SIMD vectors, but others may be possible too).
We have a global array called valid_impl_id_arr indicating all the implementations capabable running on the current CPU.
An implementor defines an algorithm and gets an invariant ID (here ALGORITHM_ID_HEX_ENCODE, should be kept in a global header).
These Ids are valid for all architectures, even if there exists no accelerated version yet.
In internal arrays (see hex_x86_64.asm) all possible implementations are stored according with their requirements (CPU features, minimum length etc.).
In the initialization phase of the running exe (backend or forend in the future) the current cpu_capabilities are checked once and the maximum valid implementation index is stored in the global visible valid_impl_id_arr.
The highest requirements have the highest Index, so the capabilities are checked in decreasing index order.
For example (hex_encode): We have 4 implementations, but on an AVX2-only machine the valid_impl_id_arr[ALGORITHM_ID_HEX_ENCODE] is only set to 3, because the requirements of AVX512BW are not met. There is always index zero indicating the algorithm has no valid implementation or the CPU has no sufficiant capabilities.
To disable an algorithm totally from being accelerated the masking by an algotithm_disable_mask is provided, which is normally all zero but could be set to disable a certain amount of algorithms by ORing (1«ALGORITHM_ID_constants). This emergency disablement should be kept in a GUC and applied only at image initialization time.
The CPU-capabilites are determined by cpuid instructions (on x86-64) and defined in cpu_capabilties_x86_64.asm.
But this scheme is not restricted to Intel ISA only. Other hardware architectures (most probably ARM, POWER or RISCV) are identified by different CPU_IS_ARCH_xxx constants (numbers from 1-7) and implementers get the specific CPU capabilities in their own fashion which may be totally different to Intel ISA.
So every CPU has its cpu_capabilities_unmasked value as an unique int64 value.
This value is normally copied 1:1 to the global cpu_capabilities, but for testing or in emergency it is masked by a configuration mask simulating a certain CPU. This allows a developer to test the implementations for lower-class cpus without the need for the specific hardware.
This cpu_capabilities_mask defaults to -1 (all bits 1) and should be derived also from a GUC.
For up to 63 algorithms we need 2 int64 GUC values to selectively disable certain parts of accelerated implementation.
Help is greatly appreciated to code this concepts with GUCs and put the globals and their initialization at the right place.
TOOL CHAIN (HELP NEEDED)
On x86-64 I use nasm (Netwide assembler) because its well maintained, fast, instruction complete and covers multiple object format.
The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output formats are supported.
The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around accordingly. The same assembler-source-code can be used on both platforms.
Website for downloading the win binary / rpm repo
I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on configure.
I also have no knowledge on other operating systems (MAC-OS etc.)
The calling conventions can be easily adopted if they differ but somebody else should jump in for testing.
If absolutely needed, nasm allows cross-assembling for a different platform, so the objects could be provided in a library for these cases.
For Windows the nasm support must be integrated into the generation of the *.vcxproj for Visual Studio.
I found the VSNASM project on github which explains how to integrate Nasm into VisualStudio.
https://github.com/ShiftMediaProject/VSNASM
But I really need help by an expert to integrate it in the perl building process.
My internal development on windows is using manually assembly/linking so far.
I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for .vcxobj integration.
OUTLOOK
Once the toolchain and global postgres integration is done (these are totally new for me) this kind of vector (or matrix perhaps) acceleration is quite easy.
By identifying simple algorithms and using some architecture knowledge of the choosen platform a new implementation is easily coded and debugged because the complexity is often limited (performance optimization may be a challenge).
The integration to postgres remains quite locally and is not very invasive.
The acceleration for the specific algorithm is really huge, despite it puts the focus on other bottlenecks in the current code base. This makes the base algorithms almost disappear in CPU-usage and extends the scale to the dimensions of Terabytes.
The whole architecture is thereby not limited to Intel ISA (even if this is certainly the most common real world use case) and can be easily adopted to other hardware architectures.
I have some other algorithms already in the pipeline, formost hex_decode (which must be debugged and checked for error handling), but during implementation i stumbled over base64_encode/decode which has also its implementations coded.
I only want to start with a first project (hex_encode/hex_decode) targetting PG15 if possible and approved by the community. Then I’ll try to polish/debug/document the whole project to finish it to a committable state.
There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging, generate_series, integer and floating point output …) which could be addressed later on.
Due to my different background (not really a C hacker) I need some help from some experts in specific areas. For coding Intel vector assembly for the project I can provide some help with tips and revisions.
I have CC-included some people of the project who offered help or where already involved in this coding area.
Thank you all very much for your patience with this new project
Hans Buschmann
Attachments:
0001_hex_encode.patchapplication/octet-stream; name=0001_hex_encode.patchDownload
diff --git a/postgresql-15devel_orig/src/Makefile.global.in b/postgresql-15devel/src/Makefile.global.in
index 05c54b2..ea5c785 100644
--- a/postgresql-15devel_orig/src/Makefile.global.in
+++ b/postgresql-15devel/src/Makefile.global.in
@@ -270,6 +270,10 @@ LLVM_CPPFLAGS = @LLVM_CPPFLAGS@
LLVM_CFLAGS = @LLVM_CFLAGS@
LLVM_CXXFLAGS = @LLVM_CXXFLAGS@
+# TODO should be adapted to configure
+NASM = nasm
+NASMFLAGS = elf64
+
# Kind-of compilers
BISON = @BISON@
@@ -782,6 +786,10 @@ endif
%.bz2: %
$(BZIP2) -c $< >$@
+%.o: %.asm
+ $(NASM) -f $(NASMFLAGS) -g -o $@ $<
+
+
# Direct builds of foo.c -> foo are disabled to avoid generating
# *.dSYM junk on Macs. All builds should normally go through the
# foo.c -> foo.o -> foo steps. This also ensures that dependency
diff --git a/postgresql-15devel_orig/src/backend/utils/adt/Makefile b/postgresql-15devel/src/backend/utils/adt/Makefile
index 41b486b..fa74e69 100644
--- a/postgresql-15devel_orig/src/backend/utils/adt/Makefile
+++ b/postgresql-15devel/src/backend/utils/adt/Makefile
@@ -25,6 +25,7 @@ OBJS = \
bool.o \
cash.o \
char.o \
+ cpu_capabilities_x86_64.o \
cryptohashfuncs.o \
date.o \
datetime.o \
@@ -42,6 +43,7 @@ OBJS = \
geo_ops.o \
geo_selfuncs.o \
geo_spgist.o \
+ hex_x86_64.o \
inet_cidr_ntop.o \
inet_net_pton.o \
int.o \
diff --git a/postgresql-15devel/src/backend/utils/adt/cpu_capabilities_x86_64.asm b/postgresql-15devel/src/backend/utils/adt/cpu_capabilities_x86_64.asm
new file mode 100644
index 0000000..bcb7db3
--- /dev/null
+++ b/postgresql-15devel/src/backend/utils/adt/cpu_capabilities_x86_64.asm
@@ -0,0 +1,630 @@
+%ifdef __NASM_MAJOR__
+%ifdef COMPILE_C_STYLE_COMMENTS
+/*-------------------------------------------------------------------------
+ *
+ * cpu_capabilities_x86_64.asm
+ * Assembler routines for fetching the cpu_capabilities in a convenient int64
+ * and selecting the maximum possible implementation for all valid algorithms
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/cpu_capabilities_x86_64.asm
+ *
+ *-------------------------------------------------------------------------
+ */
+%endif
+
+
+
+
+
+
+; cpu_capabilities_x86_64.sam
+; Assembler routines for converting a buffer to hex (cpu_capabilities_encode_xxx)
+; and restore the binary from cpu_capabilities_code (cpu_capabilities_decode_xxx) on Intel X64
+
+
+; nasm -f WIN64 -g cpu_capabilities_x86_64.asm -l cpu_capabilities_x86_64.lis
+
+; golink /console hexdump.obj cpu_capabilities_x86_64.obj hex_x86_64.obj base64_x86_64.obj /files
+
+; Linux register order: %rdi, %rsi, %rdx, %rcx, %r8 and %r9
+; Windows register order: rcx, rdx, r8, r9
+
+; Windows non volatile registers: rbx,rbp,rdi,rsi,rsp, r12,r13,r14,r15 and xmm6-xmm15
+; Linux non volatile registers: rbx,rbp, rsp, r12,r13,r14,r15
+
+; nasm -f elf64 -g cpu_capabilities_x86_64.asm -l cpu_capabilities_x86_64_elf64.lis
+
+
+%ifidn __OUTPUT_FORMAT__, win64
+%define __WIN__ 1
+%elifidn __OUTPUT_FORMAT__, elf64
+%define __ELF__ 1
+%endif
+
+
+global apply_cpu_capabilities
+
+global get_instr_info
+
+;;global get_highest_impl_id
+
+
+
+default rel
+
+section .rdata align=64
+
+; these are the id defines for different algorithms implemented or planned
+; every implementation know its own IMPL_ID, which should never change
+;
+
+%define ALGORITHM_ID_HEX_ENCODE 0
+%define ALGORITHM_ID_HEX_DECODE 1
+%define ALGORITHM_ID_BASE64_ENCODE 2
+%define ALGORITHM_ID_BASE64_DECODE 3
+%define ALGORITHM_ID_CECKSUM 4
+%define ALGORITHM_ID_CECKSUM_COPY 5
+
+
+
+
+%define CPU_IS_ARCH_X86_64 1
+%define CPU_HAS_SSE2 8
+%define CPU_HAS_SSE3 9
+%define CPU_HAS_SSSE3 10
+%define CPU_HAS_SSE4_1 11
+%define CPU_HAS_SSE4_2 12
+%define CPU_HAS_AVX 13
+%define CPU_HAS_F16C 14
+%define CPU_HAS_AVX2 15
+%define CPU_HAS_AVX512_F 16
+%define CPU_HAS_AVX512_VL 17
+%define CPU_HAS_AVX512_DQ 18
+%define CPU_HAS_AVX512_BW 19
+%define CPU_HAS_AVX512_IFMA 20
+%define CPU_HAS_AVX512_VBMI 21
+%define CPU_HAS_AVX512_VBMI2 22
+%define CPU_HAS_AVX512_VNNI 23
+%define CPU_HAS_AVX512_BITALG 24
+%define CPU_HAS_AVX512_VPOPCNTDQ 25
+%define CPU_HAS_AVX512_VP2INTERSECT 26
+%define CPU_HAS_AVX512_FP16 27
+%define CPU_HAS_AMX_TILE 28
+%define CPU_HAS_AMX_BF16 29
+%define CPU_HAS_AMX_INT8 31
+
+
+REQUIREMENTS_ARR:
+HEX_ENC_CPU_REQUIREMENTS_ARR:
+ dq 0
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_SSE2)
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_SSSE3)
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_AVX2)
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_AVX512_BW)
+
+
+
+HEX_DEC_CPU_REQUIREMENTS_ARR:
+ dq 0
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_SSE2)
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_AVX2)
+ dq CPU_IS_ARCH_X86_64 + (1<<CPU_HAS_AVX512_BW)
+
+
+CPU_REQUIREMENTS_OFFS_ARR_X86_64:
+ dq HEX_ENC_CPU_REQUIREMENTS_ARR - REQUIREMENTS_ARR ; ALGORITHM_ID_HEX_ENCODE 0
+ dq HEX_DEC_CPU_REQUIREMENTS_ARR - REQUIREMENTS_ARR ; ALGORITHM_ID_HEX_DECODE 1
+ dq 0 ; ALGORITHM_ID_BASE64_ENCODE 2
+ dq 0 ; ALGORITHM_ID_BASE64_DECODE 3
+ dq 0 ; ALGORITHM_ID_CECKSUM 4
+ dq 0 ; ALGORITHM_ID_CECKSUM_COPY 5
+
+ times 60 dq 0
+ dq -1
+
+
+
+
+VALID_IMPL_ID_ARR_X86_64:
+; TODO due to error reduced to 3 from 4 !!!
+ dq 3 ; ALGORITHM_ID_HEX_ENCODE 0
+ dq 3 ; ALGORITHM_ID_HEX_DECODE 1
+ dq 0 ; ALGORITHM_ID_BASE64_ENCODE 2
+ dq 0 ; ALGORITHM_ID_BASE64_DECODE 3
+ dq 0 ; ALGORITHM_ID_CECKSUM 4
+ dq 0 ; ALGORITHM_ID_CECKSUM_COPY 5
+
+ times 60 dq -1
+
+
+
+
+
+
+section .text align=32
+
+
+%use smartalign
+
+ ALIGNMODE p6
+
+
+%ifdef USE_LOCAL_IMPL_ID_ARR_LOOKUP
+;----------------------------------------------------------------------------------------------
+
+; get_highest_impl_id accepts the ALGORITHM_ID of the requested algorithm
+; and returns the highest impl_id valid on the current architecture.
+; A return value of zero indicates no valid implementation
+
+get_highest_impl_id:
+
+ sub rsp,0x28
+
+%ifdef __WIN__
+; mov rcx,rcx ; WIN parameter 1 ALGORITHM_ID
+%else
+ mov rcx,rdi ; LINUX parameter 1 ALGORITHM_ID
+%endif
+
+ mov rax,[8*rcx+VALID_IMPL_ID_ARR_X86_64]
+
+ add rsp,0x28
+
+ ret
+
+%endif
+
+
+;----------------------------------------------------------------------------------------------
+
+; apply_cpu_capabilities fetches the cpu capabilities and compres it to the cpu_requirements
+; of all Algorithms.
+; It uses a local array of algorithm_impl_id, which contains the maximum impl_id for
+; this specific algorithm. A zero indicates the algorithm has not benn implemented yet,
+; a -1 terminates the list
+; according to the cpu capabilities a public impl_id_array (passed by address parameter)
+; is filled with the maximum supported impl_ids of the algorithms
+; this public array is initialized to all zero, so the support of a specific algorithm must
+; be enabled by this routine
+; on other CPU architectures (only 1 architecture is active in any program) similar routeines
+; can fill the public impl_id_arr according to their implementations.
+; For reference the cpu_capabilites are set to a public int8 variable passed by address.
+; there is a unique value for every interesting CPU architecture.
+;
+; To mask one or more algorithms (bug hunting, debugging), a mask can be passed as third parameter.
+; every Algorithm_ID has the corresponding bit in the mask (bit 0..x corresponds to algorithm 0..x).
+; when the bit for a certain algorithm is set, the public impl_id_arr is not set and left at its
+; default of zero.
+
+%define STACK_ADJ 0x28+4*8
+
+apply_cpu_capabilities:
+ ; parameter 1 address cpu_capabilities (int8)
+ ; parameter 2 address valid_impl_arr (array of int8)
+ ; parameter 3 bitmask algorithm disable (int8 by value)
+
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+0*8],rdi
+ mov [rsp+1*8],rsi
+ mov [rsp+2*8],r9
+ mov [rsp+3*8],r15
+
+%ifdef __WIN__
+ mov rdi,rcx ; parameter 1 address cpu_capabilities
+
+ mov rsi,rdx ; parameter 2 address valid_impl_arr
+
+ mov rdx,r8 ; parameter 3 bitmask algorithm disable
+
+%endif
+
+ mov r9,[rdi] ; load current cpu_capabilities
+ cmp r9,0
+ jns .capabilities_set
+
+ call get_instr_info
+ mov [rdi],rax
+ mov r9,rax ; r9 = current capabilities
+
+.capabilities_set:
+
+ mov r8,1 ; r8 bitmask for current algorithm
+ lea rdi,[VALID_IMPL_ID_ARR_X86_64] ; rdi address of local impl_id_arr
+ xor r15,r15 ; r15 current offset into algorithm arrays
+
+.loop_algorithm:
+ mov rcx,[rdi+8*r15] ; rcx = max index for current algorithm
+ cmp rcx,0
+ jl .end_loop_algorithm ; current index = -1 -> goto end
+ je .skip_algorithm
+
+ lea rax,[CPU_REQUIREMENTS_OFFS_ARR_X86_64] ; start offset of requirement_arr of current algorithm
+ mov rax,[rax+8*r15] ; start offset of requirement_arr of current algorithm
+ lea r10,[REQUIREMENTS_ARR] ; r10 pointer to requirement_arr for current algorithm
+ add r10,rax
+
+.check_requirements:
+ mov rax,r9 ; rax temp for current capabilities
+ and rax,[8*rcx+r10]
+ cmp rax,[8*rcx+r10]
+ je .max_index_found
+ sub rcx,1
+ jnz .check_requirements
+
+.max_index_found:
+ mov rax,r8 ; rax temp for bitmask of current algorithm
+ test rax,rdx
+ jnz .skip_algorithm
+ mov [rsi+8*r15],rcx
+
+
+.skip_algorithm:
+ add r8,r8 ; shift bitmask of current algorithm 1 to the right
+ add r15,1
+ jmp .loop_algorithm
+
+.end_loop_algorithm:
+
+.return:
+
+ mov rdi,[rsp+0*8]
+ mov rsi,[rsp+1*8]
+ mov r9 ,[rsp+2*8]
+ mov r15,[rsp+3*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+;----------------------------------------------------------------------------------------------
+
+ ; CPUID Input EAX=01h
+ ; Feature Information Returned in the ECX Register (according to Intel Instruction Manual)
+ ;ECX bit
+ ;-> 0 SSE3 Streaming SIMD Extensions 3
+ ; 1 PCLMULQDQ
+ ; 2 DTES64 64-bit DS Area.
+ ; 3 MONITOR MONITOR/MWAIT.
+ ; 4 DS-CPL CPL Qualified Debug Store.
+ ; 5 VMX Virtual Machine Extensions.
+ ; 6 SMX Safer Mode Extensions.
+ ; 7 EIST Enhanced Intel SpeedStep® technology.
+ ; 8 TM2 Thermal Monitor 2.
+ ;-> 9 SSSE3
+ ; 10 CNXT-ID L1 Context ID.
+ ; 11 SDBG
+ ; 12 FMA
+ ; 13 CMPXCHG16B
+ ; 14 xTPR Update Control
+ ; 15 PDCM Perfmon and Debug Capability.
+ ; 16 Reserved
+ ; 17 PCID Process-context identifiers.
+ ; 18 DCA
+ ;-> 19 SSE4_1
+ ;-> 20 SSE4_2
+ ; 21 x2APIC
+ ; 22 MOVBE
+ ; 23 POPCNT
+ ; 24 TSC-Deadline
+ ; 25 AESNI
+ ; 26 XSAVE
+ ; 27 OSXSAVE
+ ;-> 28 AVX
+ ;-> 29 F16C
+ ; 30 RDRAND
+ ; 31 Not Used
+
+ ; CPUID Input EAX=01h
+ ; Feature Information Returned in the EDX Register (according to Intel Instruction Manual)
+ ; EDX bit
+ ; 0 FPU Floating Point Unit On-Chip.
+ ; 1 VME Virtual 8086 Mode Enhancements.
+ ; 2 DE Debugging Extensions.
+ ; 3 PSE Page Size Extension.
+ ; 4 TSC Time Stamp Counter.
+ ; 5 MSR Model Specific Registers RDMSR and WRMSR Instructions.
+ ; 6 PAE Physical Address Extension.
+ ; 7 MCE Machine Check Exception.
+ ; 8 CX8 CMPXCHG8B Instruction. Th
+ ; 9 APIC APIC On-Chip.
+ ; 10 Reserved
+ ; 11 SEP SYSENTER and SYSEXIT Instructions.
+ ; 12 MTRR Memory Type Range Registers
+ ; 13 PGE Page Global Bit.
+ ; 14 MCA Machine Check Architecture.
+ ; 15 CMOV Conditional Move Instructions.
+ ; 16 PAT Page Attribute Table.
+ ; 17 PSE-36 36-Bit Page Size Extension.
+ ; 18 PSN Processor Serial Number.
+ ; 19 CLFSH CLFLUSH Instruction.
+ ; 20 Reserved
+ ; 21 DS Debug Store.
+ ; 22 ACPI Thermal Monitor and Software Controlled Clock Facilities.
+ ; 23 MMX Intel MMX Technology.
+ ; 24 FXSR FXSAVE and FXRSTOR Instructions.
+ ; 25 SSE SSE.
+ ;-> 26 SSE2 SSE2.
+ ; 27 SS Self Snoop.
+ ; 28 HTT Max APIC IDs reserved field is Valid.
+ ; 29 TM Thermal Monitor.
+ ; 30 Reserved
+ ; 31 PBE Pending Break Enable.
+ ;
+
+ ; CPUID Input EAX=07H
+ ; Feature Information returned in the EAX-EDX Registers (according to Intel Instruction Set extension Manual)
+
+; EBX bits
+ ; EBX Bit00: FSGSBASE. Supports RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE if 1.
+ ; EBX Bit01: IA32_TSC_ADJUST MSR is supported if 1.
+ ; EBX Bit02: SGX
+ ; EBX Bit03: BMI1
+ ; EBX Bit04: HLE
+ ;-> EBX Bit05: Intel® AVX2
+ ; EBX Bit06: FDP_EXCPTN_ONLY. x87 FPU Data Pointer updated only on x87 exceptions if 1.
+ ; EBX Bit07: SMEP. Supports Supervisor Mode Execution Protection if 1.
+ ; EBX Bit08: BMI2
+ ; EBX Bit09: Supports Enhanced REP MOVSB/STOSB if 1.
+ ; EBX Bit10: INVPCID
+ ; EBX Bit11: RTM
+ ; EBX Bit12: RDT-M. Supports Intel® Resource Director Technology (Intel® RDT) Monitoring capability if 1.
+ ; EBX Bit13: Deprecates FPU CS and FPU DS values if 1.
+ ; EBX Bit14: Intel® Memory Protection Extensions
+ ; EBX Bit15: RDT-A. Supports Intel® Resource Director Technology (Intel® RDT) Allocation capability if 1.
+ ;-> EBX Bit16: AVX512F
+ ;-> EBX Bit17: AVX512DQ
+ ; EBX Bit18: RDSEED
+ ; EBX Bit19: ADX
+ ; EBX Bit20: SMAP
+ ;-> EBX Bit21: AVX512_IFMA
+ ; EBX Bit22: Reserved
+ ; EBX Bit23: CLFLUSHOPT
+ ; EBX Bit24: CLWB
+ ; EBX Bit25: Intel Processor Trace
+ ; EBX Bit26: AVX512PF (Intel® Xeon Phi™ only.)
+ ; EBX Bit27: AVX512ER (Intel® Xeon Phi™ only.)
+ ; EBX Bit28: AVX512CD
+ ; EBX Bit29: SHA
+ ;-> EBX Bit30: AVX512BW
+ ;-> EBX Bit31: AVX512VL
+
+
+; ECX bits
+ ; ECX Bit00: PREFETCHWT1 (Intel® Xeon Phi™ only.)
+ ;-> ECX Bit01: AVX512_VBMI
+ ; ECX Bit02: UMIP. Supports user-mode instruction prevention if 1.
+ ; ECX Bit03: PKU. Supports protection keys for user-mode pages if 1.
+ ; ECX Bit04: OSPKE. If 1, OS has set CR4.PKE to enable protection keys (and the RDPKRU/WRPKRU instructions).
+ ; ECX Bit05: WAITPKG
+ ;-> ECX Bit06: AVX512_VBMI2
+ ; ECX Bit07: CET_SS. Supports CET shadow stack features if 1.
+ ; ECX Bit08: GFNI
+ ; ECX Bit09: VAES
+ ; ECX Bit10: VPCLMULQDQ
+ ;-> ECX Bit11: AVX512_VNNI
+ ;-> ECX Bit12: AVX512_BITALG
+ ; ECX Bit13: TME_EN.
+ ;-> ECX Bit14: AVX512_VPOPCNTDQ
+ ; ECX Bit15: Reserved
+ ; ECX Bit16: LA57. Supports 57-bit linear addresses and five-level paging if 1.
+ ; ECX Bits 21-17: The value of MAWAU used by the BNDLDX and BNDSTX instructions in 64-bit mode.
+ ; ECX Bit22: RDPID and IA32_TSC_AUX are available if 1.
+ ; ECX Bit23: KL. Supports Key Locker if 1.
+ ; ECX Bit24: Reserved
+ ; ECX Bit25: CLDEMOTE. Supports cache line demote if 1.
+ ; ECX Bit26: Reserved
+ ; ECX Bit27: MOVDIRI. Supports MOVDIRI if 1.
+ ; ECX Bit28: MOVDIR64B. Supports MOVDIR64B if 1.
+ ; ECX Bit29: ENQCMD: Supports Enqueue Stores if 1.
+ ; ECX Bit30: SGX_LC. Supports SGX Launch Configuration if 1.
+ ; ECX Bit31: PKS. Supports protection keys for supervisor-mode pages if 1.
+
+; EDX bits
+ ; EDX Bits 01-00: Reserved
+ ; EDX Bit02: AVX512_4VNNIW (Intel® Xeon Phi™ only.)
+ ; EDX Bit03: AVX512_4FMAPS (Intel® Xeon Phi™ only.)
+ ; EDX Bit04: Fast Short REP MOV
+ ; EDX Bit05: UINTR. If 1, the processor supports user interrupts.
+ ; EDX Bits 07-06: Reserved
+ ;-> EDX Bit08: AVX512_VP2INTERSECT
+ ; EDX Bit09: Reserved
+ ; EDX Bit10: MD_CLEAR supported.
+ ; EDX Bits 13-11: Reserved
+ ; EDX Bit14: SERIALIZE
+ ; EDX Bit15: Hybrid. If 1, the processor is identified as a hybrid part.
+ ; EDX Bit16: TSXLDTRK. If 1, the processor supports Intel TSX suspend load address tracking.
+ ; EDX Bit17: Reserved
+ ; EDX Bit18: PCONFIG
+ ; EDX Bit19: Reserved
+ ; EDX Bit20: CET_IBT. Supports CET indirect branch tracking features if 1.
+ ; EDX Bit21: Reserved
+ ;-> EDX Bit22: AMX-BF16. If 1, the processor supports tile computational operations on bfloat16 numbers.
+ ;-> EDX Bit23: AVX512_FP16
+ ;-> EDX Bit24: AMX-TILE. If 1, the processor supports tile architecture
+ ;-> EDX Bit25: AMX-INT8. If 1, the processor supports tile computational operations on 8-bit integers.
+ ; EDX Bit26: Enumerates support for indirect branch restricted speculation (IBRS) and the indirect branch predictor barrier (IBPB).
+ ; EDX Bit27: Enumerates support for single thread indirect branch predictors (STIBP).
+ ; EDX Bit29: Enumerates support for the IA32_ARCH_CAPABILITIES MSR.
+ ; EDX Bit30: Enumerates support for the IA32_CORE_CAPABILITIES MSR.
+ ; EDX Bit31: Enumerates support for Speculative Store Bypass Disable (SSBD).
+
+
+
+%define STACK_ADJ 0x28+6*8
+
+get_instr_info:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+0*8],rbx
+ mov [rsp+1*8],rcx
+ mov [rsp+2*8],rdx
+ mov [rsp+3*8],r8
+ mov [rsp+4*8],r9
+ mov [rsp+5*8],r15
+
+
+ ; NOTE: the upper bits 32-63 of the corresponding 64bit register are zeroed on 32bit movs!
+ ; so it is easy to adapt the scheme to more CPU-features occupying the upper 32 bits
+ mov r9d,CPU_IS_ARCH_X86_64
+
+;LEAF_01H
+ mov eax,0x01
+ cpuid
+
+;ECX
+ mov r8d,ecx
+ and r8d,1<<0
+ shl r8d,CPU_HAS_SSE3-0
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<9
+ shl r8d,CPU_HAS_SSSE3-9
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<19
+ shr r8d,19-CPU_HAS_SSE4_1
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<20
+ shr r8d,20-CPU_HAS_SSE4_2
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<28
+ shr r8d,28-CPU_HAS_AVX
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<29
+ shr r8d,29-CPU_HAS_F16C
+ or r9d,r8d
+
+
+;EDX
+ mov r8d,edx
+ and r8d,1<<26
+ shr r8d,26-CPU_HAS_SSE2
+ or r9d,r8d
+
+
+;LEAF_07H
+ mov eax,0x07
+ mov ecx,0
+ cpuid
+
+;EBX
+ mov r8d,ebx
+ and r8d,1<<5
+ shl r8d,CPU_HAS_AVX2-5
+ or r9d,r8d
+
+ mov r8d,ebx
+ and r8d,1<<16
+ shl r8d,CPU_HAS_AVX512_F-16
+ or r9d,r8d
+
+ mov r8d,ebx
+ and r8d,1<<17
+ shl r8d,CPU_HAS_AVX512_DQ-17
+ or r9d,r8d
+
+ mov r8d,ebx
+ and r8d,1<<21
+ shr r8d,21-CPU_HAS_AVX512_IFMA
+ or r9d,r8d
+
+ mov r8d,ebx
+ and r8d,1<<30
+ shr r8d,30-CPU_HAS_AVX512_BW
+ or r9d,r8d
+
+ mov r8d,ebx
+ and r8d,1<<31
+ shr r8d,31-CPU_HAS_AVX512_VL
+ or r9d,r8d
+
+;ECX
+ mov r8d,ecx
+ and r8d,1<<1
+ shl r8d,CPU_HAS_AVX512_VBMI-1
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<6
+ shl r8d,CPU_HAS_AVX512_VBMI2-6
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<11
+ shl r8d,CPU_HAS_AVX512_VNNI-11
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<12
+ shl r8d,CPU_HAS_AVX512_BITALG-12
+ or r9d,r8d
+
+ mov r8d,ecx
+ and r8d,1<<14
+ shl r8d,CPU_HAS_AVX512_VPOPCNTDQ-14
+ or r9d,r8d
+
+;EDX
+ mov r8d,edx
+ and r8d,1<<8
+ shl r8d,CPU_HAS_AVX512_VP2INTERSECT-8
+ or r9d,r8d
+
+ mov r8d,edx
+ and r8d,1<<22
+ shl r8d,CPU_HAS_AMX_BF16-22
+ or r9d,r8d
+
+ mov r8d,edx
+ and r8d,1<<23
+ shl r8d,CPU_HAS_AVX512_FP16-23
+ or r9d,r8d
+
+ mov r8d,edx
+ and r8d,1<<24
+ shl r8d,CPU_HAS_AMX_TILE-24
+ or r9d,r8d
+
+ mov r8d,edx
+ and r8,1<<25
+ shl r8,CPU_HAS_AMX_INT8-25
+ or r9,r8
+
+; example for CPU_HAS_property_GT_31
+; mov r8d,edx
+; and r8,1<<26
+; shl r8,CPU_HAS_PROPERTY_GT_31-26
+; or r9,r8
+
+
+
+ mov rax,r9
+
+ mov rbx,[rsp+0*8]
+ mov rcx,[rsp+1*8]
+ mov rdx,[rsp+2*8]
+ mov r8 ,[rsp+3*8]
+ mov r9 ,[rsp+4*8]
+ mov r15,[rsp+5*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+;----------------------------------------------------------------------------------------------
+%endif
diff --git a/postgresql-15devel_orig/src/backend/utils/adt/encode.c b/postgresql-15devel/src/backend/utils/adt/encode.c
index 6dd93f9..7c37989 100644
--- a/postgresql-15devel_orig/src/backend/utils/adt/encode.c
+++ b/postgresql-15devel/src/backend/utils/adt/encode.c
@@ -19,6 +19,7 @@
#include "utils/builtins.h"
#include "utils/memutils.h"
+#define ALGORITHM_ID_HEX_ENCODE 0
/*
* Encoding conversion API.
@@ -39,6 +40,28 @@ struct pg_encoding
static const struct pg_encoding *pg_find_encoding(const char *name);
+
+ /* TODO BEGIN of block which should be moved to global initialization */
+
+static int64 cpu_capabilities_unmasked = -1;
+static int64 cpu_capabilities = -1;
+static int64 cpu_capabilities_mask = -1;
+static int64 algorithm_disable_mask = 0;
+
+static int64 valid_impl_id_arr[64];
+
+extern size_t apply_cpu_capabilities (int64 *capabilities, int64 *impl_id_arr, int64 mask);
+
+
+extern size_t get_hex_encode_alloc_addon (size_t srclen, int64 impl_id);
+
+extern size_t get_hex_decode_alloc_addon (size_t srclen, int64 impl_id);
+
+extern size_t hex_encode_fast (char *dst, const char *src, size_t srclen, int64 impl_id);
+
+ /* END init */
+
+
/*
* SQL functions.
*/
@@ -161,14 +184,46 @@ uint64
hex_encode(const char *src, size_t len, char *dst)
{
const char *end = src + len;
+ size_t n_done = 0;
+ size_t resultlen;
+ size_t len_reduce = 256;
- while (src < end)
+ /* TODO BEGIN of block which should be moved to global initialization */
+
+ /*
+ * Check the CPU capabilities only once.
+ * we call it through hex_enc_len in case this has not been called before
+ */
+ if (cpu_capabilities < 0)
{
- *dst++ = hextbl[(*src >> 4) & 0xF];
- *dst++ = hextbl[*src & 0xF];
- src++;
+ resultlen = hex_enc_len(src,len);
+// len_reduce = (resultlen>>1)-len;
+// elog(NOTICE,"ppast_hex_enc_len len_reduce %d bytes",len_reduce);
+ }
+
+ /* END init */
+
+#if defined(__x86_64__) || defined(_M_AMD64)
+ if (len >= 512)
+ {
+ n_done = hex_encode_fast(dst,src,len-len_reduce,valid_impl_id_arr[ALGORITHM_ID_HEX_ENCODE]);
}
- return (uint64) len * 2;
+#endif
+
+ if (n_done < len)
+ {
+ src += n_done;
+ dst += n_done<<1;
+ while (src < end)
+ {
+ *dst++ = hextbl[(*src >> 4) & 0xF];
+ *dst++ = hextbl[*src & 0xF];
+ src++;
+ }
+ }
+// elog(NOTICE,"post_hex_encode return double_len %d bytes",len<<1);
+
+ return (uint64) len << 1;
}
static inline char
@@ -223,13 +278,38 @@ hex_decode(const char *src, size_t len, char *dst)
return p - dst;
}
-static uint64
+uint64
hex_enc_len(const char *src, size_t srclen)
{
- return (uint64) srclen << 1;
+ /* TODO BEGIN of block which should be moved to global initialization */
+
+ /*
+ * Check the CPU capabilities only once.
+ * When cpu_capabilities is not set (is < 0) we call the architecture-
+ * dependant instruction information.
+ * An architecture supported for ASM/SIMD acceleration returns a positive
+ * value, for all other (not yet) supported architectures we set it to 0.
+ */
+#if defined(__x86_64__) || defined(_M_AMD64)
+ if (cpu_capabilities < 0)
+ {
+ apply_cpu_capabilities(&cpu_capabilities_unmasked, valid_impl_id_arr, algorithm_disable_mask);
+ cpu_capabilities = cpu_capabilities_unmasked&cpu_capabilities_mask;
+ elog(NOTICE,"post_apply cpu_capabilities = %ld",cpu_capabilities);
+ elog(NOTICE,"post_apply valid_impl_id_0 = %ld",valid_impl_id_arr[ALGORITHM_ID_HEX_ENCODE]);
+
+ }
+#else
+ cpu_capabilities = 0;
+#endif
+
+ /* END init */
+
+ return (uint64) (srclen << 1) +
+ get_hex_encode_alloc_addon(srclen,valid_impl_id_arr[ALGORITHM_ID_HEX_ENCODE]);
}
-static uint64
+uint64
hex_dec_len(const char *src, size_t srclen)
{
return (uint64) srclen >> 1;
diff --git a/postgresql-15devel/src/backend/utils/adt/hex_x86_64.asm b/postgresql-15devel/src/backend/utils/adt/hex_x86_64.asm
new file mode 100644
index 0000000..c2fd0c6
--- /dev/null
+++ b/postgresql-15devel/src/backend/utils/adt/hex_x86_64.asm
@@ -0,0 +1,2915 @@
+%ifdef __NASM_MAJOR__
+%ifdef COMPILE_C_STYLE_COMMENTS
+/*-------------------------------------------------------------------------
+ *
+ * hex_x86_64.asm
+ * Assembler routines for converting a buffer to hex (hex_encode_xxx)
+ * and restore the binary from hex code (hex_decode_xxx) on Intel X64
+ *
+ * Copyright (c) 2021-2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/utils/adt/hex_x86_64.asm
+ *
+ *-------------------------------------------------------------------------
+ */
+%endif
+
+
+
+
+
+
+; hex_x86_64.sam
+; Assembler routines for converting a buffer to hex (hex_encode_xxx)
+; and restore the binary from hex_code (hex_decode_xxx) on Intel X64
+
+
+; nasm -f WIN64 -g hex_x86_64.asm -l hex_x86_64.lis
+
+; golink /console hexdump.obj hex_x86_64.obj base64_x86_64.obj /files
+
+; Linux register order: %rdi, %rsi, %rdx, %rcx, %r8 and %r9
+; Windows register order: rcx, rdx, r8, r9
+
+; Windows non volatile registers: rbx,rbp,rdi,rsi,rsp, r12,r13,r14,r15 and xmm6-xmm15
+; Linux non volatile registers: rbx,rbp, rsp, r12,r13,r14,r15
+
+; nasm -f elf64 -g hex_x86_64.asm -l hex_x86_64_elf64.lis
+
+
+%ifidn __OUTPUT_FORMAT__, win64
+%define __WIN__ 1
+%elifidn __OUTPUT_FORMAT__, elf64
+%define __ELF__ 1
+%endif
+
+%define NSHIFT_ADDRESS_TO_PAGE 12
+
+%define N_BYTES_PER_SSE2 16
+%define N_BYTES_PER_AVX2 32
+%define N_BYTES_PER_AVX512 64
+
+global get_hex_encode_alloc_addon
+global get_hex_decode_alloc_addon
+
+global hex_encode_fast
+
+global hex_encode_sse2
+global hex_encode_ssse3
+global hex_encode_avx2
+global hex_encode_avx512bw
+
+global hex_decode_sse2
+global hex_decode_avx2
+global hex_decode_avx512bw
+
+
+default rel
+
+section .rdata align=64
+
+; values loaded with VMOVDQA64 in AVX512, so 64 bytes needed
+
+%define VPERM_AVX2_OFFS 0b11_01_10_00
+
+VPERM_ENCODE_OFFSETS dq 0,4,1,5,2,6,3,7
+VPERM_DECODE_OFFSETS dq 0,2,4,6,1,3,5,7
+
+ENCODE_SHUFFLE_TO_HEX times 4 db '0123456789abcdef'
+
+ENCODE_SHUFFLE_TO_HIGH_LOW times 4 db 8,0,9,1, 10,2,11,3, 12,4,13,5, 14,6,15,7
+
+
+; from here on values used with VPBROADCASTQ in AVX512 / VMOVDQA in AVX2, so only 16/32 bytes needed
+
+;BITMASK_UPPER_HALF times 32 db 0b1111_0000
+BITMASK_LOWER_HALF times 32 db 0b0000_1111
+
+BITMASK_NIBBLE_3_IN_WORD times 16 dw 0x0F00
+
+BITMASK_LITTLE_TO_BIG_ASCII times 32 db 0b1101_1111
+BITMASK_BIG_TO_LITTLE_ASCII times 32 db 0b0010_0000
+
+BITMASK_ZERO_ONE times 32 db 0b0101_0101
+
+BITMASK_ONE_ZERO times 32 db 0b1010_1010
+
+BITMASK_SELECT_DIGIT times 32 db 0b0011_1111
+
+ALL_BYTES_9 times 32 db 9
+
+ASCII_LITTLE_A_ADD:
+ALL_BYTES_39 times 32 db 39
+
+ASCII_0_OFFSET:
+ALL_BYTES_48 times 32 db 48
+
+;ASCII_DIGIT_9 times 32 db 48+9
+
+ASCII_LETTER_LITTLE_A times 32 db 'a'
+ASCII_LETTER_LITTLE_F times 32 db 'f'
+
+HEX_ENCODE_ARRAYS:
+HEX_ENC_MIN_SRC_LEN_ARR:
+ dq 0
+ dq 128
+ dq 512
+ dq 512
+ dq 1024
+
+HEX_ENC_RESULT_BUFFER_OVERFLOW_ARR:
+ dq 0
+ dq 64
+ dq 128
+ dq 128
+ dq 256
+
+HEX_ENC_IMPL_ROUTINE_ARR:
+ dq 0
+ dq hex_encode_sse2
+ dq hex_encode_ssse3
+ dq hex_encode_avx2
+ dq hex_encode_avx512bw
+
+
+HEX_DECODE_ARRAYS:
+HEX_DEC_MIN_SRC_LEN_ARR:
+ dq 0
+ dq 128
+ dq 512
+ dq 1024
+
+HEX_DEC_RESULT_BUFFER_OVERFLOW_ARR:
+ dq 0
+ dq 64
+ dq 128
+ dq 256
+
+HEX_DEC_IMPL_ROUTINE_ARR:
+ dq 0
+ dq hex_decode_sse2
+ dq hex_decode_avx2
+ dq hex_decode_avx512bw
+
+
+
+section .text align=32
+
+
+%use smartalign
+
+ ALIGNMODE p6
+
+%ifdef __WIN__
+%define STACK_FOR_XMM 10*16
+%else
+%define STACK_FOR_XMM 0
+%endif
+
+;----------------------------------------------------------------------------------------------
+
+; get_hex_encode_alloc_addon returns the tail-handling-required allocation addon
+; according to the request length and the maximum valid impl_id
+; it looks for the correct values in the hex_enc_tables indexed by impl_id
+
+get_hex_encode_alloc_addon:
+
+ sub rsp,0x28
+
+%ifdef __WIN__
+; mov rcx,rcx ; WIN parameter 1 requested source len
+; mov rdx,rdx ; WIN parameter 2 maximum valid impl_id
+%else
+ mov rcx,rdi ; LINUX parameter 1 requested source len
+ mov rdx,rsi ; LINUX parameter 2 maximum valid impl_id
+%endif
+
+ lea r8,[HEX_ENC_MIN_SRC_LEN_ARR]
+.loop_search:
+ cmp rcx,[r8+8*rdx] ; compare requested length with current impl_id
+ jge .offset_found
+ sub rdx,1 ; lower impl_id
+ jnz .loop_search
+.offset_found:
+ lea r8,[HEX_ENC_RESULT_BUFFER_OVERFLOW_ARR]
+ mov rax,[r8+8*rdx] ; return the alloc_overflow
+
+ add rsp,0x28
+
+ ret
+
+
+;----------------------------------------------------------------------------------------------
+
+; get_hex_decode_alloc_addon returns the tail-handling-required allocation addon
+; according to the request length and the maximum valid impl_id
+; It looks for the correct values in the hex_enc_tables indexed by impl_id
+
+get_hex_decode_alloc_addon:
+
+ sub rsp,0x28
+
+%ifdef __WIN__
+; mov rcx,rcx ; WIN parameter 1 requested source len
+; mov rdx,rdx ; WIN parameter 2 maximum valid impl_id
+%else
+ mov rcx,rdi ; LINUX parameter 1 requested source len
+ mov rdx,rsi ; LINUX parameter 2 maximum valid impl_id
+%endif
+
+ lea r8,[HEX_DEC_MIN_SRC_LEN_ARR]
+.loop_search:
+ cmp rcx,[r8+8*rdx] ; compare requested length with current impl_id
+ jge .offset_found
+ sub rdx,1 ; lower impl_id
+ jnz .loop_search
+.offset_found:
+ lea r8,[HEX_DEC_RESULT_BUFFER_OVERFLOW_ARR]
+ mov rax,[r8+8*rdx] ; return the alloc_overflow
+
+ add rsp,0x28
+
+ ret
+
+
+
+
+;----------------------------------------------------------------------------------------------
+
+; hex_encode_fast is the dispatcher routine according to the cpu capabilities and
+; the length of the encode request.
+;
+; Parameter 4 (moved to r15) is the maximum valid impl_id fullfilling the cpu requirements
+; (determined at program initialization time outside this routine)
+; The index into the HEX_ENCODE_ARRAYS is set to the maximum supported requirements.
+; When r15 == 0 no fast encode is supported and a zero length is returned.
+
+%define STACK_ADJ 0x28+2*8
+
+hex_encode_fast:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+0*8],r9
+ mov [rsp+1*8],r15
+
+
+ ; r15 = checked highest valid index
+%ifdef __WIN__
+ mov rax,r8 ; WIN parameter 3 number of elements
+ mov r15,r9 ; WIN parameter 4 maximum valid impl_id
+%else
+ mov rax,rdx ; LINUX parameter 3 number of elements
+ mov r15,rcx ; LINUX parameter 4 maximum valid impl_id
+%endif
+
+ lea r10,[HEX_ENC_MIN_SRC_LEN_ARR]
+
+.check_length:
+ cmp rax,[r10+8*r15]
+ jge .max_length_found
+ sub r15,1
+ jnz .check_length
+
+.max_length_found:
+ xor rax,rax
+ cmp r15,0
+ jz .return
+
+ lea r10,[HEX_ENC_IMPL_ROUTINE_ARR]
+ call [r10+8*r15]
+
+.return:
+ mov r9,[rsp+0*8]
+ mov r15,[rsp+1*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+
+
+%define STACK_ADJ 0x28+6*8+STACK_FOR_XMM
+
+
+
+;----------------------------------------------------------------------------------------------
+
+
+; xmm15 ; CONST ALL bytes 9
+; xmm14 ; CONST BITMASK_LOWER_HALF
+; xmm13 ; CONST ASCII_0_OFFSET
+; xmm12 ; CONST ASCII_LITTLE_A_ADD
+; xmm11 ; Prefetch Input line 3
+; xmm10 ; Prefetch Input line 2
+; xmm9 ; Input Line 1
+; xmm8 ; Input Line 0
+
+; xmm7 ; Unpack RL1 Rght Half low bits secnd line
+; xmm6 ; Unpack RH1 Rght Half high bits secnd line
+; xmm5 ; Unpack LL1 Left Half low bits secnd line
+; xmm4 ; Unpack LH1 Left Half high bits secnd line
+; xmm3 ; Unpack RL0 Rght Half low bits first line
+; xmm2 ; Unpack RH0 Rght Half high bits first line
+; xmm1 ; Unpack LL0 Left Half low bits first line
+; xmm0 ; Unpack LH0 Left Half high bits first line
+
+
+
+%define NINP_BYTES_PER_ROUND 2*16
+%define NINP_BITSHIFT 5
+
+hex_encode_sse2:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ MOVDQA [rsp ],xmm6
+ MOVDQA [rsp+1*16],xmm7
+ MOVDQA [rsp+2*16],xmm8
+ MOVDQA [rsp+3*16],xmm9
+ MOVDQA [rsp+4*16],xmm10
+ MOVDQA [rsp+5*16],xmm11
+ MOVDQA [rsp+6*16],xmm12
+ MOVDQA [rsp+7*16],xmm13
+ MOVDQA [rsp+8*16],xmm14
+ MOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+
+;; initializer for QQ0 and QQ1
+
+ MOVDQU xmm8,[rsi+0*16] ; QQ0 p__23__ p__23__ l8 QL0
+ MOVDQU xmm9,[rsi+1*16] ; QQ1 p__23__ p__23__ l8 QL0
+
+;; initialize constants
+
+ MOVDQA xmm15,[ALL_BYTES_9] ; p_23__ l3
+
+ MOVDQA xmm14,[BITMASK_NIBBLE_3_IN_WORD] ; p_23__ l3
+
+; MOVDQA xmm13,[ALL_BYTES_48] ; p_23__ l3
+
+ MOVDQA xmm12,[ALL_BYTES_39] ; p_23__ l3
+
+
+ MOVDQA xmm13,xmm12
+ PADDB xmm13,xmm15 ; 48 = 39+9
+
+
+;; do page overshoot checks
+
+ mov rax,NINP_BYTES_PER_ROUND
+
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ lea r8,[rdx+NINP_BYTES_PER_ROUND-1]
+ shr r8,NINP_BITSHIFT ; number of loops
+ shl r8,NINP_BITSHIFT
+ add r8,rsi ; r8 address of last byte+1 read in complete loops
+ add r8,NINP_BYTES_PER_ROUND-1 ; r8 address of last byte read in normal loop with overshoot
+
+ mov r11,r8
+
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte requested input
+ shr r8,NSHIFT_ADDRESS_TO_PAGE ; r8 page number of last byte read after normal round
+
+ cmp rcx,r8 ; stay on same page
+ je .LSAME_PAGE_IN_ROUND
+ sub rdx,rax ; don't overshoot in reading: do one round less
+
+.LSAME_PAGE_IN_ROUND:
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NINP_BYTES_PER_ROUND<<1 ; increment of output buffer for each round
+
+;; start preprocessing before loop
+
+ add rsi,rax ; add the number of processed array elements
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+.LHEXENCODE_LOOP:
+
+ MOVDQA xmm6,xmm8
+ PUNPCKLBW xmm6,xmm9 ; QL0 p____5 l1 QQ0 [Lin0_RghtH] [00 HL_0 00 HL_1 ...]
+
+
+ MOVDQA xmm7,xmm8
+ PUNPCKHBW xmm7,xmm9 ; QL0 p____5 l1 QQ0 [Lin0_LeftH] [00 HL_0 00 HL_1 ...]
+
+ MOVDQA xmm8,xmm10
+
+ MOVDQU xmm10,[rsi+0*16] ; QL0 p_____5 p1____5 l3+ QL0
+
+
+;;
+ MOVDQA xmm4,xmm6
+ PSRLW xmm4,12 ; RL2 shift RL2 Hx to lower byte in word
+ MOVDQA xmm5,xmm6
+ PAND xmm5,xmm14 ; RL2 mask nibble 3 in word (lower nibble shifted 8 bits left)
+
+ MOVDQA xmm9,xmm11
+
+ MOVDQU xmm11,[rsi+1*16] ; QL1 p_____5 p1____5 l3+ QL0
+
+ add rsi,rax ; add the number of processed array elements
+
+
+ PSLLW xmm6,8 ; RL0 rotate (shift) RL0 1 byte to left
+
+ MOVDQA xmm0,xmm6
+ PSRLW xmm0,4+8 ; RL0 shift RL0 Hx to lower byte in word
+ POR xmm4,xmm5 ; RL2 low nibble, high nibble at correct position (0L0H)
+
+
+
+ MOVDQA xmm1,xmm6
+ PAND xmm1,xmm14 ; RL0 mask nibble 3 in word (lower nibble shifted 8 bits left)
+
+ MOVDQA xmm6,xmm7
+ PSLLW xmm6,8 ; RL1 rotate (shift) RL1 1 byte to left
+
+
+ MOVDQA xmm5,xmm4
+ PCMPGTB xmm5,xmm15 ; RL2 all letters set to 0xFF, all digits to 0
+
+
+ POR xmm0,xmm1 ; RL0 low nibble, high nibble at correct position (0L0H)
+
+ PADDB xmm4,xmm13 ; RL2 add ASCII_0_OFFSET to all bytes, digits OK, letters yet missing +39
+
+
+ MOVDQA xmm2,xmm6
+ PSRLW xmm2,4+8 ; RL1 shift RL1 Hx to lower byte in word
+ MOVDQA xmm3,xmm6
+ PAND xmm3,xmm14 ; RL1 mask nibble 3 in word (lower nibble shifted 8 bits left)
+
+ MOVDQA xmm1,xmm0
+ PCMPGTB xmm1,xmm15 ; RL0 all letters set to 0xFF, all digits to 0
+
+
+ PAND xmm5,xmm12 ; RL2 for all letters set to 39, else 0 (
+ ; RL2 example for 102 (letter f) - 48 (bit45) - 15 (value) = 39 (amount to add for letters)
+
+ POR xmm2,xmm3 ; RL1 low nibble, high nibble at correct position (0L0H)
+
+ PAND xmm1,xmm12 ; RL0 for all letters set to 39, else 0 (
+ ; RL0 example for 102 (letter f) - 48 (bit45) - 15 (value) = 39 (amount to add for letters)
+ PADDB xmm4,xmm5 ; RL2 final result line RL2
+
+ PADDB xmm0,xmm13 ; RL0 add ASCII_0_OFFSET to all bytes, digits OK, letters yet missing +39
+
+ MOVDQA xmm3,xmm2
+ PCMPGTB xmm3,xmm15 ; RL1 all letters set to 0xFF, all digits to 0
+
+ PADDB xmm2,xmm13 ; RL1 add ASCII_0_OFFSET to all bytes, digits OK, letters yet missing +39
+
+
+ PADDB xmm0,xmm1 ; RL0 final result line RL0
+
+ MOVDQA xmm1,xmm7
+ PSRLW xmm1,12 ; RL3 shift RL3 Hx to lower byte in word
+
+ PAND xmm3,xmm12 ; RL1 for all letters set to 39, else 0 (
+ ; RL1 example for 102 (letter f) - 48 (bit45) - 15 (value) = 39 (amount to add for letters)
+ PADDB xmm2,xmm3 ; RL1 final result line RL1
+
+ MOVDQU [rdi+0*16],xmm0 ; RL0 RL0 p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+ PAND xmm7,xmm14 ; RL3 mask nibble 3 in word (lower nibble shifted 8 bits left)
+
+ MOVDQA xmm6,xmm7
+ POR xmm6,xmm1 ; RL3 low nibble, high nibble at correct position (0L0H)
+
+
+ MOVDQU [rdi+1*16],xmm2 ; RL1 RL1 p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+
+ MOVDQA xmm7,xmm6
+ PCMPGTB xmm7,xmm15 ; RL3 all letters set to 0xFF, all digits to 0
+
+ PADDB xmm6,xmm13 ; RL3 add ASCII_0_OFFSET to all bytes, digits OK, letters yet missing +39
+
+ PAND xmm7,xmm12 ; RL3 for all letters set to 39, else 0 (
+ ; RL3 example for 102 (letter f) - 48 (bit45) - 15 (value) = 39 (amount to add for letters)
+ MOVDQU [rdi+2*16],xmm4 ; RL2 RL2 p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+
+ PADDB xmm6,xmm7 ; RL3 final result line RL2
+
+ MOVDQU [rdi+3*16],xmm6 ; RL3 RL3 p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+
+ add rdi,rcx ; add the number of processed output bytes
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXENCODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH_EXTRA
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH_NORMAL
+
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ xor rax,rax
+ jmp .LHEXENCODE_LOOP
+
+
+.LFINISH_EXTRA:
+ add rsi,NINP_BYTES_PER_ROUND ; add the extra round to get processed bytes
+ jmp .LFINISH
+
+.LFINISH_NORMAL:
+ sub rsi,NINP_BYTES_PER_ROUND ; sub the added prefetch round to get processed bytes
+
+.LFINISH:
+ ; r9 = address of requested input bytes+1
+ ; rsi = address of processed input bytes+1
+ ; now get the minimum of rdx,rsi to rax
+;; sub rsi,rax ; for last round do nothing (rax=0), else sub increment for one round
+;; sub r9,rax
+
+ mov rax,r12
+ cmp rsi,r12 ; get min from r12 (address of requested input) and rsi (address of done input)
+
+ jge .LCALC_PROCESSED_BYTES
+ mov rax,rsi ; rax=address of last valid input byte+1
+
+.LCALC_PROCESSED_BYTES:
+ sub rax,r10 ; sub the input buffer start address
+ ; rax = number of valid processed input bytes = return value
+
+ cmp rsi,rdx ; compare rdx (address of requested input) and rsi (address of done input)
+ je .LNO_ZERO_OUT
+
+ mov r15,rax ; number of elements to process
+
+ shl r15,1 ; number of output bytes
+
+ add r15,r11 ; pointer to next byte after full valid output buffer
+
+ PXOR xmm0,xmm0 ; all zero
+;ZERO MOVDQU [r15],xmm0 ; zero out one register width after last output
+
+.LNO_ZERO_OUT:
+
+%ifdef __WIN__
+
+ MOVDQA xmm6 ,[rsp ]
+ MOVDQA xmm7 ,[rsp+1*16]
+ MOVDQA xmm8 ,[rsp+2*16]
+ MOVDQA xmm9 ,[rsp+3*16]
+ MOVDQA xmm10,[rsp+4*16]
+ MOVDQA xmm11,[rsp+5*16]
+ MOVDQA xmm12,[rsp+6*16]
+ MOVDQA xmm13,[rsp+7*16]
+ MOVDQA xmm14,[rsp+8*16]
+ MOVDQA xmm15,[rsp+9*16]
+
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+
+;----------------------------------------------------------------------------------------------
+
+
+
+
+; xmm15 ; CONST ALL bytes 9
+; xmm14 ; MAX byte value of all lines (Init all to letter 'F' (0x46) ;CONST BITMASK_LOWER_HALF
+; xmm13 ; MIN byte value of non-ascii-digit values (not 3x) (Init all to letter 'A' (0x41) ;CONST BITMASK_ASCII_0
+; xmm12 ; ORed compare of all digit-values cmp > 9 (Init all zero) ;CONST BITMASK_WORD_LOWER_BYTE
+; xmm11 ; Input line 3
+; xmm10 ; Input line 2
+; xmm9 ; Input line 1
+; xmm8 ; Input Line 0
+
+; xmm7 ; Unpack RL1 Rght Half low bits secnd line
+; xmm6 ; Unpack RH1 Rght Half high bits secnd line
+; xmm5 ; Unpack LL1 Left Half low bits secnd line
+; xmm4 ; Unpack LH1 Left Half high bits secnd line
+; xmm3 ; Unpack RL0 Rght Half low bits first line
+; xmm2 ; Unpack RH0 Rght Half high bits first line
+; xmm1 ; Unpack LL0 Left Half low bits first line
+; xmm0 ; Unpack LH0 Left Half high bits first line
+
+
+%define NINP_BYTES_PER_ROUND 4*16
+%define NINP_BITSHIFT 6
+
+hex_decode_sse2:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ MOVDQA [rsp ],xmm6
+ MOVDQA [rsp+1*16],xmm7
+ MOVDQA [rsp+2*16],xmm8
+ MOVDQA [rsp+3*16],xmm9
+ MOVDQA [rsp+4*16],xmm10
+ MOVDQA [rsp+5*16],xmm11
+ MOVDQA [rsp+6*16],xmm12
+ MOVDQA [rsp+7*16],xmm13
+ MOVDQA [rsp+8*16],xmm14
+ MOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+;; initializer for QQ0 and QQ1
+
+ MOVDQU xmm8,[rsi] ;
+ MOVDQU xmm9,[rsi+1*16] ;
+
+ MOVDQU xmm10,[rsi+2*16] ;
+ MOVDQU xmm11,[rsi+3*16] ;
+
+;; initialize constants
+
+ mov r15,[BITMASK_BIG_TO_LITTLE_ASCII]
+
+ MOVDQA xmm7,[BITMASK_LOWER_HALF]
+
+ MOVDQA xmm15,[ALL_BYTES_9] ; p_23__ l3
+
+ MOVDQA xmm14,[ASCII_LETTER_LITTLE_F] ; p_23__ l3
+
+ MOVDQA xmm13,[ASCII_LETTER_LITTLE_A] ; p_23__ l3
+
+ PXOR xmm12,xmm12 ; all zero
+
+ MOVQ xmm2,r15 ; 0b0010_0000
+
+
+;; do page overshoot checks
+;; due to end condition handling not done here, we only process full rounds
+
+ mov rax,NINP_BYTES_PER_ROUND
+
+ add rdx,NINP_BYTES_PER_ROUND-1
+ shr rdx,NINP_BITSHIFT ;
+ shl rdx,NINP_BITSHIFT ; rdx number of bytes read in normal loop equiv to xxx full loops
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ mov r11,r9
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte requested input
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NINP_BYTES_PER_ROUND>>1 ; increment of output buffer for each round
+
+;; start preprocessing before loop
+
+ PUNPCKLQDQ xmm2,xmm2 ; all bytes 0b0010_0000
+
+; PUNPCKLQDQ xmm7,xmm7 ; all bytes 0b0000_1111
+
+ MOVDQA xmm0,xmm2
+ MOVDQA xmm1,xmm2
+
+ MOVDQA xmm4,xmm2
+ MOVDQA xmm5,xmm2
+
+ add rsi,rax ; add the number of processed array elements
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+.LHEXDECODE_LOOP:
+
+ MOVDQA xmm6,xmm2
+
+
+ PSRAD xmm6,1 ; all bytes 0b0001_0000
+
+ POR xmm0,xmm8 ; line 0 all letters set to little ASCII a-f
+ POR xmm1,xmm9
+ POR xmm4,xmm10
+ POR xmm5,xmm11
+
+ PMAXUB xmm14,xmm0
+ PMAXUB xmm14,xmm1
+ PMAXUB xmm14,xmm4
+ PMAXUB xmm14,xmm5
+
+;max check finished
+
+ POR xmm0,xmm6 ; line 0 with bits for ASCII_0 set (Byte OR 0bxx11_xxxx)
+ POR xmm1,xmm6
+ POR xmm4,xmm6
+ POR xmm5,xmm6
+
+ PCMPEQD xmm6,xmm6 ; all ONE
+
+ PCMPEQB xmm0,xmm8 ; set to all ONE when ASCI Digit (forced bit 0bxx11_xxxx equal to orig value)
+ PCMPEQB xmm1,xmm9
+ PCMPEQB xmm4,xmm10
+ PCMPEQB xmm5,xmm11
+
+;start min check line0+1
+ MOVDQA xmm2,xmm0 ; copy all one when digit
+ MOVDQA xmm3,xmm1
+
+ PANDN xmm2,xmm6 ; set to all one for values NOT digits
+ PANDN xmm3,xmm6
+
+ PAND xmm2,xmm8 ; set to orig value when NOT ASCI Digit
+ PAND xmm3,xmm9
+
+ POR xmm2,xmm0 ; set all zero bytes to all one
+ POR xmm3,xmm1
+
+ PMINUB xmm13,xmm2
+ PMINUB xmm13,xmm3
+
+
+;start min check line2+3
+ MOVDQA xmm2,xmm4 ; copy all one when digit
+ MOVDQA xmm3,xmm5
+
+
+ PANDN xmm2,xmm6 ; set to all one for values NOT digits
+ PANDN xmm3,xmm6
+
+ PAND xmm2,xmm10 ; set to orig value when NOT ASCI Digit
+ PAND xmm3,xmm11
+
+ POR xmm2,xmm4 ; set all zero bytes to all one
+ POR xmm3,xmm5
+
+ PMINUB xmm13,xmm2
+ PMINUB xmm13,xmm3
+
+
+; start legal digit check
+
+ MOVDQA xmm2,xmm0 ; copy all one when digit
+ MOVDQA xmm3,xmm1
+
+ PAND xmm2,xmm8 ; set to orig value when ASCI Digit
+ PAND xmm3,xmm9
+
+ PAND xmm2,xmm7 ; set to lower nibble value when ASCI Digit
+ PAND xmm2,xmm7
+
+ PCMPGTB xmm2,xmm15 ; set to all ONE when ASCI Digit and value > 9
+ PCMPGTB xmm3,xmm15
+
+ POR xmm12,xmm2 ; accumulate illegal chars like ASCII digit and value > 9
+ POR xmm12,xmm3
+
+ PAND xmm2,xmm10 ; set to orig value when ASCI Digit
+ PAND xmm3,xmm11
+
+ PAND xmm2,xmm7 ; set to lower nibble value when ASCI Digit
+ PAND xmm3,xmm7
+
+
+ PCMPGTB xmm2,xmm15 ; set to all ONE when ASCI Digit and value > 9
+ PCMPGTB xmm3,xmm15 ; set to orig value when ASCI Digit
+
+ POR xmm12,xmm2
+ POR xmm12,xmm3
+
+
+
+;-- ; all checks accumulated, xmm2,xmm3,xmm6,xmm7 have lower nibbles of lines 0-3
+ PCMPEQD xmm6,xmm6 ; all ONE
+ PSRLW xmm6,8 ; QQ0 p01____ p01____ l1
+
+ MOVDQA xmm2,xmm7 ; all bytes 0b0000_1111
+ MOVDQA xmm3,xmm7
+
+ PAND xmm2,xmm8 ; all byte values only lower half (nibble) Line 0+1
+ MOVDQU xmm8,[rsi+0*16] ;
+ PAND xmm3,xmm9
+ MOVDQU xmm9,[rsi+1*16] ;
+
+ PANDN xmm0,xmm15 ; put 9 to every element not DIGIT
+ PANDN xmm1,xmm15
+
+ PADDB xmm2,xmm0 ; add 9 to every nibble not DIGIT
+ PADDB xmm3,xmm1
+
+ MOVDQA xmm0,xmm2
+ PSRLW xmm0,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ PSLLW xmm2,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ MOVDQA xmm1,xmm3
+ PSRLW xmm1,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ PSLLW xmm3,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ POR xmm0,xmm2 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+ POR xmm1,xmm3 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+
+ PAND xmm0,xmm6 ; line 0
+ PAND xmm1,xmm6 ; line 1
+
+ PACKUSWB xmm0,xmm1 ; QQ0 p_____5 p_1___5 l1 QQ0 [Lin0_LeftH] [HL_0 00 HL_1 00 ...]
+
+; line 0 and 1 processed
+
+
+ MOVDQA xmm2,xmm7 ; all bytes 0b0000_1111
+ MOVDQA xmm3,xmm7
+
+ PAND xmm2,xmm10 ; all byte values only lower half (nibble) Line 0+1
+ MOVDQU xmm10,[rsi+2*16] ;
+ PAND xmm3,xmm11
+ MOVDQU xmm11,[rsi+3*16] ;
+
+ PANDN xmm4,xmm15 ; put 9 to every element not DIGIT
+ PANDN xmm5,xmm15
+
+ PADDB xmm2,xmm4 ; add 9 to every nibble not DIGIT
+ PADDB xmm3,xmm5
+
+ add rsi,rax ; add the number of processed array elements
+
+ MOVDQU [rdi+0*16],xmm0 ; S0_ p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+
+ MOVDQA xmm4,xmm2
+ PSRLW xmm4,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ PSLLW xmm2,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ MOVDQA xmm5,xmm3
+ PSRLW xmm5,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ PSLLW xmm3,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+
+ POR xmm4,xmm2 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+ POR xmm5,xmm3 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+
+ MOVQ xmm2,r15 ;
+ PUNPCKLQDQ xmm2,xmm2 ; all bytes 0b0010_0000
+
+ MOVDQA xmm0,xmm2
+ MOVDQA xmm1,xmm2
+
+; MOVQ xmm7,rcx ;
+
+ PAND xmm4,xmm6 ;
+ PAND xmm5,xmm6 ; line 1
+
+ PACKUSWB xmm4,xmm5 ; QQ0 p_____5 p_1___5 l1 QQ0 [Lin0_LeftH] [HL_0 00 HL_1 00 ...]
+
+; MOVDQA xmm1,xmm11
+
+
+ MOVDQU [rdi+1*16],xmm4 ; S0_ p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+ MOVDQA xmm4,xmm2
+ MOVDQA xmm5,xmm2
+
+; PUNPCKLQDQ xmm7,xmm7 ; all bytes 0b0000_1111
+
+
+ add rdi,rcx ; add the number of processed output bytes
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXDECODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH
+
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ xor rax,rax
+ jmp .LHEXDECODE_LOOP
+
+.LFINISH:
+
+ mov rax,rdi
+ sub rax,r11 ; rax = number of output bytes
+ add rax,rax ; rax = number of valid processed input bytes = return value
+
+%ifdef __WIN__
+
+ MOVDQA xmm6 ,[rsp ]
+ MOVDQA xmm7 ,[rsp+1*16]
+ MOVDQA xmm8 ,[rsp+2*16]
+ MOVDQA xmm9 ,[rsp+3*16]
+ MOVDQA xmm10,[rsp+4*16]
+ MOVDQA xmm11,[rsp+5*16]
+ MOVDQA xmm12,[rsp+6*16]
+ MOVDQA xmm13,[rsp+7*16]
+ MOVDQA xmm14,[rsp+8*16]
+ MOVDQA xmm15,[rsp+9*16]
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+
+
+;----------------------------------------------------------------------------------------------
+
+
+
+
+; ymm15 ; CONST ALL bytes 9
+; ymm14 ; MAX byte value of all lines (Init all to letter 'F' (0x46) ;CONST BITMASK_LOWER_HALF
+; ymm13 ; MIN byte value of non-ascii-digit values (not 3x) (Init all to letter 'A' (0x41) ;CONST BITMASK_ASCII_0
+; ymm12 ; ORed compare of all digit-values cmp > 9 (Init all zero) ;CONST BITMASK_WORD_LOWER_BYTE
+; ymm11 ; Input line 3
+; ymm10 ; Input line 2
+; ymm9 ; Input line 1
+; ymm8 ; Input Line 0
+
+; ymm7 ; CONST BITMASK_LOWER_HALF ;Unpack RL1 Rght Half low bits secnd line
+; ymm6 ; Unpack RH1 Rght Half high bits secnd line
+; ymm5 ; Unpack LL1 Left Half low bits secnd line
+; ymm4 ; Unpack LH1 Left Half high bits secnd line
+; ymm3 ; Unpack RL0 Rght Half low bits first line
+; ymm2 ; Unpack RH0 Rght Half high bits first line
+; ymm1 ; Unpack LL0 Left Half low bits first line
+; ymm0 ; Unpack LH0 Left Half high bits first line
+
+
+%define NINP_BYTES_PER_ROUND 4*32
+%define NINP_BITSHIFT 7
+
+hex_decode_avx2:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ VMOVDQA [rsp ],xmm6
+ VMOVDQA [rsp+1*16],xmm7
+ VMOVDQA [rsp+2*16],xmm8
+ VMOVDQA [rsp+3*16],xmm9
+ VMOVDQA [rsp+4*16],xmm10
+ VMOVDQA [rsp+5*16],xmm11
+ VMOVDQA [rsp+6*16],xmm12
+ VMOVDQA [rsp+7*16],xmm13
+ VMOVDQA [rsp+8*16],xmm14
+ VMOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+;; initializer for QQ0 and QQ1
+
+ VMOVDQU ymm8,[rsi+0*32] ;
+ VMOVDQU ymm9,[rsi+1*32] ;
+
+ VMOVDQU ymm10,[rsi+2*32] ;
+ VMOVDQU ymm11,[rsi+3*32] ;
+
+;; initialize constants
+
+ VMOVDQA ymm15,[ALL_BYTES_9] ; p_23__ l3
+
+ VMOVDQA ymm14,[ASCII_LETTER_LITTLE_F] ; p_23__ l3
+
+ VMOVDQA ymm13,[ASCII_LETTER_LITTLE_A] ; p_23__ l3
+
+ VMOVDQA ymm7,[BITMASK_LOWER_HALF] ; 0b0000_1111
+
+ VPXOR ymm12,ymm12 ; all zero
+
+ VMOVDQA ymm2,[BITMASK_BIG_TO_LITTLE_ASCII] ; 0b0010_0000
+
+
+
+;; do page overshoot checks
+;; due to end condition handling not done here, we only process full rounds
+
+ mov rax,NINP_BYTES_PER_ROUND
+
+ add rdx,NINP_BYTES_PER_ROUND-1
+ shr rdx,NINP_BITSHIFT ;
+ shl rdx,NINP_BITSHIFT ; rdx number of bytes read in normal loop equiv to xxx full loops
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ mov r11,r9
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte requested input
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NINP_BYTES_PER_ROUND>>1 ; increment of output buffer for each round
+
+;; start preprocessing before loop
+
+ add rsi,rax ; add the number of processed array elements
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+.LHEXDECODE_LOOP:
+
+ VMOVDQA ymm6,ymm2
+
+
+ VPSRAD ymm6,1 ; all bytes 0b0001_0000
+
+ VPOR ymm0,ymm2,ymm8 ; line 0 all letters set to little ASCII a-f
+ VPOR ymm1,ymm2,ymm9
+ VPOR ymm4,ymm2,ymm10
+ VPOR ymm5,ymm2,ymm11
+
+ VPMAXUB ymm14,ymm0
+ VPMAXUB ymm14,ymm1
+ VPMAXUB ymm14,ymm4
+ VPMAXUB ymm14,ymm5
+
+;max check finished
+
+ VPOR ymm0,ymm6 ; line 0 with bits for ASCII_0 set (Byte OR 0bxx11_xxxx)
+ VPOR ymm1,ymm6
+ VPOR ymm4,ymm6
+ VPOR ymm5,ymm6
+
+ VPCMPEQD ymm6,ymm6 ; all ONE
+
+ VPCMPEQB ymm0,ymm8 ; set to all ONE when ASCI Digit (forced bit 0bxx11_xxxx equal to orig value)
+ VPCMPEQB ymm1,ymm9
+ VPCMPEQB ymm4,ymm10
+ VPCMPEQB ymm5,ymm11
+
+;start min check line0+1
+ VPANDN ymm2,ymm0,ymm6 ; set to all one for values NOT digits
+ VPANDN ymm3,ymm1,ymm6
+
+ VPAND ymm2,ymm8 ; set to orig value when NOT ASCI Digit
+ VPAND ymm3,ymm9
+
+ VPOR ymm2,ymm0 ; set all zero bytes to all one
+ VPOR ymm3,ymm1
+
+ VPMINUB ymm13,ymm2
+ VPMINUB ymm13,ymm3
+
+
+;start min check line2+3
+
+
+ VPANDN ymm2,ymm4,ymm6 ; set to all one for values NOT digits
+ VPANDN ymm3,ymm5,ymm6
+
+ VPAND ymm2,ymm10 ; set to orig value when NOT ASCI Digit
+ VPAND ymm3,ymm11
+
+ VPOR ymm2,ymm4 ; set all zero bytes to all one
+ VPOR ymm3,ymm5
+
+ VPMINUB ymm13,ymm2
+ VPMINUB ymm13,ymm3
+
+
+; start legal digit check
+
+ VPAND ymm2,ymm0,ymm8 ; set to orig value when ASCI Digit
+ VPAND ymm3,ymm1,ymm9
+
+ VPAND ymm2,ymm7 ; set to lower nibble value when ASCI Digit
+ VPAND ymm2,ymm7
+
+ VPCMPGTB ymm2,ymm15 ; set to all ONE when ASCI Digit and value > 9
+ VPCMPGTB ymm3,ymm15
+
+ VPOR ymm12,ymm2 ; accumulate illegal chars like ASCII digit and value > 9
+ VPOR ymm12,ymm3
+
+ VPAND ymm2,ymm10 ; set to orig value when ASCI Digit
+ VPAND ymm3,ymm11
+
+ VPAND ymm2,ymm7 ; set to lower nibble value when ASCI Digit
+ VPAND ymm3,ymm7
+
+
+ VPCMPGTB ymm2,ymm15 ; set to all ONE when ASCI Digit and value > 9
+ VPCMPGTB ymm3,ymm15 ; set to orig value when ASCI Digit
+
+ VPOR ymm12,ymm2
+ VPOR ymm12,ymm3
+
+; all (max, min and >9) checks finished
+
+
+;-- ; all checks accumulated, ymm2,ymm3,ymm6,ymm7 have lower nibbles of lines 0-3
+ VPCMPEQD ymm6,ymm6 ; all ONE
+ VPSRLW ymm6,8 ; QQ0 p01____ p01____ l1
+
+ VPAND ymm2,ymm7,ymm8 ; all byte values only lower half (nibble) Line 0+1
+ VMOVDQU ymm8,[rsi+0*32] ;
+ VPAND ymm3,ymm7,ymm9
+ VMOVDQU ymm9,[rsi+1*32] ;
+
+ VPANDN ymm0,ymm15 ; put 9 to every element not DIGIT
+ VPANDN ymm1,ymm15
+
+ VPADDB ymm2,ymm0 ; add 9 to every nibble not DIGIT
+ VPADDB ymm3,ymm1
+
+ VPSRLW ymm0,ymm2,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPSLLW ymm2,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPSRLW ymm1,ymm3,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPSLLW ymm3,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPOR ymm0,ymm2 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+ VPOR ymm1,ymm3 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+
+ VPAND ymm0,ymm6 ; line 0
+ VPAND ymm1,ymm6 ; line 1
+
+ VPACKUSWB ymm0,ymm1 ; QQ0 p_____5 p_1___5 l1 QQ0 [Lin0_LeftH] [HL_0 00 HL_1 00 ...]
+
+; line 0 and 1 processed
+
+
+ VPAND ymm2,ymm7,ymm10 ; all byte values only lower half (nibble) Line 0+1
+ VMOVDQU ymm10,[rsi+2*32] ;
+ VPAND ymm3,ymm7,ymm11
+ VMOVDQU ymm11,[rsi+3*32] ;
+
+ VPANDN ymm4,ymm15 ; put 9 to every element not DIGIT
+ VPANDN ymm5,ymm15
+
+ VPADDB ymm2,ymm4 ; add 9 to every nibble not DIGIT
+ VPADDB ymm3,ymm5
+
+ add rsi,rax ; add the number of processed array elements
+
+ VMOVDQU [rdi+0*32],ymm0 ; S0_ p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+
+ VPSRLW ymm4,ymm2,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPSLLW ymm2,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPSRLW ymm5,ymm3,8 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+ VPSLLW ymm3,4 ; QQ0 p01____ p01____ l1 Q03 [Lin0_LeftH] [00 0H_0 00 0H_1 ...]
+
+
+ VPOR ymm4,ymm2 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+ VPOR ymm5,ymm3 ; QQ0 p01___5 p01___5 l1 QQ3 [Lin0_LeftH] [AH AL_0 AH AL_1 ...]
+
+ VMOVDQA ymm2,[BITMASK_BIG_TO_LITTLE_ASCII] ;
+
+
+ VPAND ymm4,ymm6 ;
+ VPAND ymm5,ymm6 ; line 1
+
+ VPACKUSWB ymm4,ymm5 ; QQ0 p_____5 p_1___5 l1 QQ0 [Lin0_LeftH] [HL_0 00 HL_1 00 ...]
+
+
+ VMOVDQU [rdi+1*32],ymm4 ; S0_ p____4_ p____4_ l1 [Lin0 Left Half] Store Hexdump
+
+ add rdi,rcx ; add the number of processed output bytes
+
+
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXDECODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH
+
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ xor rax,rax
+ jmp .LHEXDECODE_LOOP
+
+.LFINISH:
+
+ mov rax,rdi
+ sub rax,r11 ; rax = number of output bytes
+ add rax,rax ; rax = number of valid processed input bytes = return value
+
+%ifdef __WIN__
+
+ VMOVDQA xmm6 ,[rsp ]
+ VMOVDQA xmm7 ,[rsp+1*16]
+ VMOVDQA xmm8 ,[rsp+2*16]
+ VMOVDQA xmm9 ,[rsp+3*16]
+ VMOVDQA xmm10,[rsp+4*16]
+ VMOVDQA xmm11,[rsp+5*16]
+ VMOVDQA xmm12,[rsp+6*16]
+ VMOVDQA xmm13,[rsp+7*16]
+ VMOVDQA xmm14,[rsp+8*16]
+ VMOVDQA xmm15,[rsp+9*16]
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+
+;----------------------------------------------------------------------------------------------
+
+; k7 ; compare flags lower eq little f, initially all ONE
+; k6 ; compare flags greater eq little a, initially all ONE QL0,QL1
+; k5 ; CONST BITMASK_ZERO_ONE 0101010101010101 selecting lower half
+; k4 ; digit flags QL3
+; k3 ; digit flags QL2
+; k2 ; digit flags QL1
+; k1 ; digit flags QL0
+; k0 ;
+
+; zmm31 ; CONST ALL ZERO
+; zmm30 ; CONST BITMASK_LOWER_HALF
+; zmm29 ; CONST ASCII_0_OFFSET
+; zmm28 ; CONST ASCII_LITTLE_A_ADD
+; zmm27 ; CONST VPERM_DECODE_OFFSETS
+; zmm26 ; CONST ALL bytes 9
+; zmm25 ; Ouptut Line OH1 (Line 0 is stored in the source load regs zmm8-zmm23
+; zmm24 ; Output Line OL1
+; zmm23 ; Preload QL3
+; zmm22 ; Preload QL2
+; zmm21 ; Preload QL1
+; zmm20 ; Preload QL0
+; zmm19 ; Source Load QL3
+; zmm18 ; Source Load QL2
+; zmm17 ; Source Load QL1
+; zmm16 ; Source Load QL0
+
+; zmm15 ; QL3 little a
+; zmm14 ; QL2 little a
+; zmm13 ; QL1 little a
+; zmm12 ; QL0 little a
+; zmm11 ; QL3 masked for digit
+; zmm10 ; QL2 masked for digit
+; zmm9 ; QL1 masked for digit
+; zmm8 ; QL0 masked for digit
+; zmm7 ; lower nibble masked QL3
+; zmm6 ; lower nibble masked QL2
+; zmm5 ; lower nibble masked QL1
+; zmm4 ; lower nibble masked QL0
+; zmm3 ;
+; zmm2 ;
+; zmm1 ;
+; zmm0 ;
+
+
+%define NINP_BYTES_PER_ROUND 4*64
+%define NINP_BITSHIFT 8
+
+hex_decode_avx512bw:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ VMOVDQA [rsp ],xmm6
+ VMOVDQA [rsp+1*16],xmm7
+ VMOVDQA [rsp+2*16],xmm8
+ VMOVDQA [rsp+3*16],xmm9
+ VMOVDQA [rsp+4*16],xmm10
+ VMOVDQA [rsp+5*16],xmm11
+ VMOVDQA [rsp+6*16],xmm12
+ VMOVDQA [rsp+7*16],xmm13
+ VMOVDQA [rsp+8*16],xmm14
+ VMOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+;; initializer for QQ0 and QQ1
+
+ VMOVDQU64 zmm20,[rsi+0*64] ; QQ0 p____5 l3+ QL4
+ VMOVDQU64 zmm21,[rsi+1*64] ; QQ0 p____5 l3+ QL5
+ VMOVDQU64 zmm22,[rsi+2*64] ; QQ0 p____5 l3+ QL6
+ VMOVDQU64 zmm23,[rsi+3*64] ; QQ0 p____5 l3+ QL7
+
+;; initialize constants
+
+ KXNORQ k7,k7,k7 ; all one
+
+ VPBROADCASTQ zmm31,[ALL_BYTES_9] ; p_23__ l3
+
+ VPBROADCASTQ zmm30,[BITMASK_LOWER_HALF] ; p_23__ l3
+
+ KXNORQ k6,k6,k6 ; all one
+
+ VPBROADCASTQ zmm29,[ASCII_LETTER_LITTLE_F] ; p_23__ l3
+
+ VPBROADCASTQ zmm28,[ASCII_LETTER_LITTLE_A] ; p_23__ l3
+
+ KMOVQ k5,[BITMASK_ZERO_ONE]
+
+ VMOVDQA64 zmm27,[VPERM_DECODE_OFFSETS] ; p_23__ l3
+
+ VPBROADCASTQ zmm26,[BITMASK_BIG_TO_LITTLE_ASCII]; p_23__ l3
+
+ VPBROADCASTQ zmm25,[BITMASK_SELECT_DIGIT] ; p_23__ l3
+
+
+;; do page overshoot checks
+;; due to end condition handling not done here, we only process full rounds
+
+ mov rax,NINP_BYTES_PER_ROUND
+
+ add rdx,NINP_BYTES_PER_ROUND-1
+ shr rdx,NINP_BITSHIFT ;
+ shl rdx,NINP_BITSHIFT ; rdx number of bytes read in normal loop equiv to xxx full loops
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ mov r11,r9
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte requested input
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NINP_BYTES_PER_ROUND>>1 ; increment of output buffer for each round
+
+;; start preprocessing before loop
+
+; VPUNPCKHBW zmm1,zmm16,zmm31 ; QL0 p____5 l1 QQ0 [Lin0_LeftH] [00 HL_0 00 HL_1 ...]
+; VPUNPCKLBW zmm3,zmm16,zmm31 ; QL0 p____5 l1 QQ0 [Lin0_RghtH] [00 HL_0 00 HL_1 ...]
+
+ add rsi,rax ; add the number of processed array elements
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+
+; Vector Port info AVX512
+; ----------------------------------------
+; VPShift p0 l1
+; VPMax/Min p0 l1
+; VPMUL p0 l5 ; with 2FMA-Units p05 (SKX,CLX etc.)
+; VPMOVB2M p0 l3
+; VPSUBUSB /SSB p0 l1
+
+; VPALIGNR p5 l1 ;Shift of n*8 bits!
+; VPERM p5 l3
+; VPERMI2x 1*p05+2*p5 l7 ; (l9 with flags)
+; VPCompare p5 l3-l4
+; VP Pack/Unpack p5 l1(SKX) l3(TGL)
+; VPSHUF p5 l1
+
+
+.LHEXDECODE_LOOP:
+
+ VMOVDQA64 zmm16,zmm20 ; QL0 copy preload to load
+ VMOVDQA64 zmm17,zmm21 ; QL1 copy preload to load
+ VPANDQ zmm8,zmm25,zmm20 ; QL0 set bitmask for digits only
+ VPMAXUB zmm0,zmm20,zmm21 ; QL0,QL1 max from both lines
+;;; VPCMPB k7{k7},zmm29,zmm20,2 ; QL0 compare lower_eq little f
+
+ VMOVDQA64 zmm18,zmm22 ; QL2
+ VMOVDQA64 zmm19,zmm23 ; QL3
+ VPCMPEQB k1,zmm8,zmm20 ; QL0 compare for is digit
+ VPANDQ zmm9,zmm25,zmm21 ; QL1 set bitmask for digits only
+
+ VMOVDQU64 zmm20,[rsi+0*64] ; QQ0 p____5 l3+ QL4
+ VMOVDQU64 zmm21,[rsi+1*64] ; QL1 p____5 l3+ QL5
+
+ VPANDQ zmm10,zmm25,zmm18 ; QL2 set bitmask for digits only
+ VPCMPEQB k2,zmm9,zmm17 ; QL1 compare for is digit
+
+ VMOVDQU64 zmm22,[rsi+2*64] ; QQ0 p____5 l3+ QL6
+ VMOVDQU64 zmm23,[rsi+3*64] ; QQ0 p____5 l3+ QL7
+
+ VPANDQ zmm11,zmm25,zmm19 ; QL3 set bitmask for digits only
+ VPCMPEQB k3,zmm10,zmm18 ; QL2 compare for is digit
+
+ VPMAXUB zmm1,zmm18,zmm19 ; QL2,QL3 max from both lines
+;;; VPCMPB k7{k7},zmm29,zmm17,2 ; QL1 compare lower_eq little f
+ VPCMPEQB k4,zmm11,zmm19 ; QL2 compare for is digit
+
+ add rsi,rax ; add the number of processed array elements
+
+ VPORQ zmm12,zmm26,zmm16 ; QL0 set bit for little a
+ VPANDQ zmm4,zmm30,zmm16 ; QL0 bitmask lower nibble
+
+ VPORQ zmm13,zmm26,zmm17 ; QL1 set bit for little a
+ VPANDQ zmm5,zmm30,zmm17 ; QL1 bitmask lower nibble
+
+ VPMAXUB zmm0,zmm0,zmm1 ; QL0,QL1,QL2,QL3 max from 4 lines
+ VPADDB zmm4,zmm4,zmm31 ; QL0 add 9
+
+ VPORQ zmm14,zmm26,zmm18 ; QL2 set bit for little a
+ VPANDQ zmm6,zmm30,zmm18 ; QL2 bitmask lower nibble
+
+ VPANDQ zmm7,zmm30,zmm19 ; QL3 bitmask lower nibble
+ VPCMPB k7{k7},zmm29,zmm0,2 ; QL0,QL1,QL2,QL3 compare lower_eq little f
+
+ VPADDB zmm5,zmm5,zmm31 ; QL1 add 9
+ VPORQ zmm15,zmm26,zmm19 ; QL3 set bit for little a
+
+ VPADDB zmm6,zmm6,zmm31 ; QL2 add 9
+ VPADDB zmm7,zmm7,zmm31 ; QL3 add 9
+
+ VPSUBB zmm4{k1},zmm4,zmm31 ; QL0 sub 9 for digits
+ VPSUBB zmm5{k2},zmm5,zmm31 ; QL1 sub 9 for digits
+ VPSUBB zmm6{k3},zmm6,zmm31 ; QL2 sub 9 for digits
+ VPSUBB zmm7{k4},zmm7,zmm31 ; QL3 sub 9 for digits
+
+;
+
+ VPSRLW zmm0,zmm4,8 ; QL0 lower nibble-value
+ VPSLLW zmm4,zmm4,2 ; QL0 upper nibble_value
+
+ VPADDB zmm0{k5}{z},zmm0,zmm4 ; QL0 values in lower byte of dword
+
+ VPSRLW zmm1,zmm5,8 ; QL1 lower nibble-value
+ VPSLLW zmm5,zmm5,2 ; QL1 upper nibble_value
+
+ VPADDB zmm1{k5}{z},zmm1,zmm5 ; QL1 values in lower byte of dword
+
+ VPACKUSWB zmm0,zmm0,zmm1 ; QL0 vlues in single bytes
+
+ VMOVDQA64 [rdi+0*64],zmm0 ;DEBUG ########### ; QL0 p____4_ p____49 l4 [Lin0 Left Half] Store Hexdump
+
+ VPERMQ zmm1,zmm27,zmm0 ; QL0,QL1 byte values in right order
+
+;
+
+ VPSRLW zmm2,zmm6,8 ; QL2 lower nibble-value
+ VPSLLW zmm6,zmm6,2 ; QL2 upper nibble_value
+
+ VPADDB zmm2{k5}{z},zmm2,zmm6 ; QL2 values in lower byte of dword
+
+ VPSRLW zmm3,zmm7,8 ; QL3 lower nibble-value
+ VPSLLW zmm7,zmm7,2 ; QL3 upper nibble_value
+
+ VPADDB zmm3{k5}{z},zmm3,zmm7 ; QL3 values in lower byte of dword
+
+ VPACKUSWB zmm2,zmm2,zmm3 ; QL2,QL3 vlues in single bytes
+
+ VMOVDQA64 [rdi+1*64],zmm2 ;DEBUG ############## ; QL0 p____4_ p____49 l4 [Lin0 Rght Half] Store Hexdump
+
+ VPERMQ zmm3,zmm27,zmm2 ; QL2,QL3 byte values in right order
+
+; -------- MISSING check for little a
+
+ VMOVDQA64 [rdi+0*64],zmm1 ; QL0 p____4_ p____49 l4 [Lin0 Left Half] Store Hexdump
+ VMOVDQA64 [rdi+1*64],zmm3 ; QL0 p____4_ p____49 l4 [Lin0 Rght Half] Store Hexdump
+
+ add rdi,rcx ; add the number of processed output bytes
+
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXDECODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH
+
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ xor rax,rax
+ jmp .LHEXDECODE_LOOP
+
+.LFINISH:
+
+ mov rax,rdi
+ sub rax,r11 ; rax = number of output bytes
+ add rax,rax ; rax = number of valid processed input bytes = return value
+
+%ifdef __WIN__
+
+ VMOVDQA xmm6 ,[rsp ]
+ VMOVDQA xmm7 ,[rsp+1*16]
+ VMOVDQA xmm8 ,[rsp+2*16]
+ VMOVDQA xmm9 ,[rsp+3*16]
+ VMOVDQA xmm10,[rsp+4*16]
+ VMOVDQA xmm11,[rsp+5*16]
+ VMOVDQA xmm12,[rsp+6*16]
+ VMOVDQA xmm13,[rsp+7*16]
+ VMOVDQA xmm14,[rsp+8*16]
+ VMOVDQA xmm15,[rsp+9*16]
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+
+;----------------------------------------------------------------------------------------------
+
+
+; ymm15 ; Source Load QL7
+; ymm14 ; Source Load QL6
+; ymm13 ; Source Load QL5
+; ymm12 ; Source Load QL4
+; ymm11 ; Source Load QL3
+; ymm10 ; Source Load QL2
+; ymm9 ; Source Load QL1
+; ymm8 ; Source Load QL0
+
+; ymm7 ; CONST ENCODE_SHUFFLE_TO_HEX
+; ymm6 ; CONST BITMASK_NIBBLE_3_IN_WORD
+; ymm5 ; Shift temp for High nibble 1
+; ymm4 ; Shift temp for High nibble 0
+; ymm3 ; Temp3
+; ymm2 ; Temp2
+; ymm1 ; Temp1
+; ymm0 ; Temp0
+
+
+%define NINP_BYTES_PER_ROUND 8*32
+%define NINP_BITSHIFT 8
+
+hex_encode_avx2:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ VMOVDQA [rsp ],xmm6
+ VMOVDQA [rsp+1*16],xmm7
+ VMOVDQA [rsp+2*16],xmm8
+ VMOVDQA [rsp+3*16],xmm9
+ VMOVDQA [rsp+4*16],xmm10
+ VMOVDQA [rsp+5*16],xmm11
+ VMOVDQA [rsp+6*16],xmm12
+ VMOVDQA [rsp+7*16],xmm13
+ VMOVDQA [rsp+8*16],xmm14
+ VMOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+;; Loading QL0-QL3, prefetching QL4-QL7
+
+ VPERMQ ymm8, [rsi+0*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL0
+ VPERMQ ymm9, [rsi+1*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL1
+ VPERMQ ymm10,[rsi+2*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL2
+ VPERMQ ymm11,[rsi+3*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL3
+
+ VPERMQ ymm12,[rsi+4*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL4
+ VPERMQ ymm13,[rsi+5*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL5
+ VPERMQ ymm14,[rsi+6*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL6
+ VPERMQ ymm15,[rsi+7*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL7
+
+;; initialize constants
+
+ VMOVDQA ymm7,[ENCODE_SHUFFLE_TO_HEX] ; p_23__ l3
+
+ VMOVDQA ymm6,[BITMASK_LOWER_HALF] ; p_23__ l3
+
+;; do page overshoot checks
+
+ mov rax,NINP_BYTES_PER_ROUND
+
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ lea r8,[rdx+NINP_BYTES_PER_ROUND-1]
+ shr r8,NINP_BITSHIFT ; number of loops
+ shl r8,NINP_BITSHIFT
+ add r8,rsi ; r8 address of last byte+1 read in complete loops
+ add r8,NINP_BYTES_PER_ROUND-1 ; r8 address of last byte read in normal loop with overshoot
+
+ mov r11,r8
+
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte requested input
+ shr r8,NSHIFT_ADDRESS_TO_PAGE ; r8 page number of last byte read after normal round
+
+ cmp rcx,r8 ; stay on same page
+ je .LSAME_PAGE_IN_ROUND
+ sub rdx,rax ; don't overshoot in reading: do one round less
+
+.LSAME_PAGE_IN_ROUND:
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NINP_BYTES_PER_ROUND<<1 ; increment of output buffer for each round
+
+;; start preprocessing before loop
+
+ VPSRLQ ymm4,ymm8,4 ; AAA RL00,RL01 QL0 shift Hx to lower nibble in byte
+ VPSRLQ ymm5,ymm9,4 ; AAA RL02,RL03 QL1 shift Hx to lower nibble in byte
+
+ VPUNPCKLBW ymm0,ymm4,ymm8 ; AAA RL00 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPUNPCKHBW ymm1,ymm4,ymm8 ; AAA RL01 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ add rsi,rax ; add the number of processed array elements
+
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+
+.LHEXENCODE_LOOP:
+
+;; process unpacked AAA in YMM0-YMM4 and YMM8-YMM11, UNPCK BBB to YMM0-YMM1, PreLoad AAA to YMM8-YMM11
+;;RL00,RL01,RL02,RL03,RL04,RL05,RL06,RL07
+ VPUNPCKLBW ymm2,ymm5,ymm9 ; AAA RL02 p____5 l1 QQ0 [Lin0_RghtH] [... R06_HL1 R04_HL1 R06_HL0 R04_HL0]
+ VPSRLQ ymm4,ymm10,4 ; AAA RL04,RL05 QL2 shift Hx to lower nibble in byte
+ VPUNPCKHBW ymm3,ymm5,ymm9 ; AAA RL03 p____5 l1 QQ0 [Lin0_LeftH] [... R07_HL1 R05_HL1 R07_HL0 R05_HL0]
+ VPAND ymm0,ymm0,ymm6 ; AAA RL00 mask lower nibble
+
+ VPSRLQ ymm5,ymm11,4 ; AAA RL06,RL07 QL3 shift Hx to lower nibble in byte
+ VPUNPCKLBW ymm8,ymm4,ymm10 ; AAA RL04 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPAND ymm1,ymm1,ymm6 ; AAA RL01 mask lower nibble
+ VPUNPCKHBW ymm9,ymm4,ymm10 ; AAA RL05 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPAND ymm2,ymm2,ymm6 ; AAA RL02 mask lower nibble
+ VPUNPCKLBW ymm10,ymm5,ymm11 ; AAA RL06 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPAND ymm3,ymm3,ymm6 ; AAA RL03 mask lower nibble
+ VPUNPCKHBW ymm11,ymm5,ymm11 ; AAA RL07 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSHUFB ymm0,ymm7,ymm0 ; AAA RL00 shuffle_to_hex_digits
+ VPAND ymm8,ymm8,ymm6 ; AAA RL04 mask lower nibble
+ VPSHUFB ymm1,ymm7,ymm1 ; AAA RL01 shuffle_to_hex_digits
+ VPAND ymm9,ymm9,ymm6 ; AAA RL05 mask lower nibble
+
+ VPSHUFB ymm2,ymm7,ymm2 ; AAA RL02 shuffle_to_hex_digits
+ VMOVDQU [rdi+0*32],ymm0 ; AAA RL00 Store Hexdump
+ VPAND ymm10,ymm10,ymm6 ; AAA RL06 mask lower nibble
+
+ VPSHUFB ymm3,ymm7,ymm3 ; AAA RL03 shuffle_to_hex_digits
+ VMOVDQU [rdi+1*32],ymm1 ; AAA RL01 Store Hexdump
+ VPAND ymm11,ymm11,ymm6 ; AAA RL07 mask lower nibble
+
+ VPSHUFB ymm8,ymm7,ymm8 ; AAA RL04 shuffle_to_hex_digits
+ VPSRLQ ymm4,ymm12,4 ; BBB RL08,RL09 QL4 shift Hx to lower nibble in byte
+ VMOVDQU [rdi+2*32],ymm2 ; AAA RL02 Store Hexdump
+ VPSHUFB ymm9,ymm7,ymm9 ; AAA RL05 shuffle_to_hex_digits
+ VPSRLQ ymm5,ymm13,4 ; BBB RL10,RL11 QL5 shift Hx to lower nibble in byte
+ VMOVDQU [rdi+3*32],ymm3 ; AAA RL03 Store Hexdump
+
+ VPSHUFB ymm10,ymm7,ymm10 ; AAA RL06 shuffle_to_hex_digits
+ VMOVDQU [rdi+4*32],ymm8 ; AAA RL04 Store Hexdump
+ VPERMQ ymm8, [rsi+0*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL0
+ VMOVDQU [rdi+5*32],ymm9 ; AAA RL05 Store Hexdump
+ VPERMQ ymm9, [rsi+1*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL1
+
+ VPSHUFB ymm11,ymm7,ymm11 ; AAA RL07 shuffle_to_hex_digits
+ VMOVDQU [rdi+6*32],ymm10 ; AAA RL06 Store Hexdump
+ VPUNPCKLBW ymm0,ymm4,ymm12 ; BBB RL08 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPERMQ ymm10,[rsi+2*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL2
+
+ VMOVDQU [rdi+7*32],ymm11 ; AAA RL07 Store Hexdump
+ VPERMQ ymm11,[rsi+3*32],VPERM_AVX2_OFFS ; AAA p_____5 p1____5 l3+ QL3
+ VPUNPCKHBW ymm1,ymm4,ymm12 ; AAA RL09 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+
+;; process unpacked BBB in YMM0-YMM4 and YMM9-YMM15, UNPCK AAA to YMM0-YMM1, PreLoad BBB to YMM12-YMM15
+;;RL08,RL09,RL10,RL11,RL12,RL13,RL14,RL15
+ VPUNPCKLBW ymm2,ymm5,ymm13 ; BBB RL10 p____5 l1 QQ0 [Lin0_RghtH] [... R06_HL1 R04_HL1 R06_HL0 R04_HL0]
+ VPSRLQ ymm4,ymm14,4 ; BBB RL12,RL13 QL6 shift Hx to lower nibble in byte
+ VPUNPCKHBW ymm3,ymm5,ymm13 ; BBB RL11 p____5 l1 QQ0 [Lin0_LeftH] [... R07_HL1 R05_HL1 R07_HL0 R05_HL0]
+ VPAND ymm0,ymm0,ymm6 ; BBB RL08 mask lower nibble
+
+ VPSRLQ ymm5,ymm15,4 ; BBB RL14,RL15 QL7 shift Hx to lower nibble in byte
+ VPUNPCKLBW ymm12,ymm4,ymm14 ; BBB RL12 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPAND ymm1,ymm1,ymm6 ; BBB RL09 mask lower nibble
+ VPUNPCKHBW ymm13,ymm4,ymm14 ; BBB RL13 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPAND ymm2,ymm2,ymm6 ; BBB RL10 mask lower nibble
+ VPUNPCKLBW ymm14,ymm5,ymm15 ; BBB RL14 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPAND ymm3,ymm3,ymm6 ; BBB RL11 mask lower nibble
+ VPUNPCKHBW ymm15,ymm5,ymm15 ; BBB RL15 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSHUFB ymm0,ymm7,ymm0 ; BBB RL08 shuffle_to_hex_digits
+ VPAND ymm12,ymm12,ymm6 ; BBB RL12 mask lower nibble
+ VPSHUFB ymm1,ymm7,ymm1 ; BBB RL09 shuffle_to_hex_digits
+ VPAND ymm13,ymm13,ymm6 ; BBB RL13 mask lower nibble
+
+ VPSHUFB ymm2,ymm7,ymm2 ; BBB RL10 shuffle_to_hex_digits
+ VMOVDQU [rdi+8*32],ymm0 ; BBB RL08 Store Hexdump
+ VPAND ymm14,ymm14,ymm6 ; BBB RL14 mask lower nibble
+
+ VPSHUFB ymm3,ymm7,ymm3 ; BBB RL11 shuffle_to_hex_digits
+ VMOVDQU [rdi+9*32],ymm1 ; BBB RL09 Store Hexdump
+ VPAND ymm15,ymm15,ymm6 ; BBB RL15 mask lower nibble
+
+ VPSHUFB ymm12,ymm7,ymm12 ; BBB RL12 shuffle_to_hex_digits
+ VPSRLQ ymm4,ymm8,4 ; AAA RL00,RL01 QL0 shift Hx to lower nibble in byte
+ VMOVDQU [rdi+10*32],ymm2 ; BBB RL10 Store Hexdump
+ VPSHUFB ymm13,ymm7,ymm13 ; BBB RL13 shuffle_to_hex_digits
+ VPSRLQ ymm5,ymm9,4 ; AAA RL02,RL03 QL1 shift Hx to lower nibble in byte
+ VMOVDQU [rdi+11*32],ymm3 ; BBB RL11 Store Hexdump
+
+ VPSHUFB ymm14,ymm7,ymm14 ; BBB RL14 shuffle_to_hex_digits
+ VMOVDQU [rdi+12*32],ymm12 ; BBB RL12 Store Hexdump
+ VPERMQ ymm12, [rsi+4*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL0
+ VMOVDQU [rdi+13*32],ymm13 ; BBB RL13 Store Hexdump
+ VPERMQ ymm13, [rsi+5*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL1
+
+ VPSHUFB ymm15,ymm7,ymm15 ; BBB RL15 shuffle_to_hex_digits
+ VMOVDQU [rdi+14*32],ymm14 ; BBB RL14 Store Hexdump
+ VPUNPCKLBW ymm0,ymm4,ymm8 ; AAA RL00 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPERMQ ymm14,[rsi+6*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL2
+
+ VMOVDQU [rdi+15*32],ymm15 ; BBB RL15 Store Hexdump
+ VPERMQ ymm15,[rsi+7*32],VPERM_AVX2_OFFS ; BBB p_____5 p1____5 l3+ QL3
+
+ add rsi,rax ; add the number of processed array elements
+
+ VPUNPCKHBW ymm1,ymm4,ymm8 ; AAA RL01 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ add rdi,rcx ; add the number of processed output bytes
+
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXENCODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH_EXTRA
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH_NORMAL
+
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ xor rax,rax
+ jmp .LHEXENCODE_LOOP
+
+
+.LFINISH_EXTRA:
+ add rsi,NINP_BYTES_PER_ROUND ; add the extra round to get processed bytes
+ jmp .LFINISH
+
+.LFINISH_NORMAL:
+ sub rsi,NINP_BYTES_PER_ROUND ; sub the added prefetch round to get processed bytes
+
+.LFINISH:
+ ; r9 = address of requested input bytes+1
+ ; rsi = address of processed input bytes+1
+ ; now get the minimum of rdx,rsi to rax
+;; sub rsi,rax ; for last round do nothing (rax=0), else sub increment for one round
+;; sub r9,rax
+
+ mov rax,r12
+ cmp rsi,r12 ; get min from r12 (address of requested input) and rsi (address of done input)
+
+ jge .LCALC_PROCESSED_BYTES
+ mov rax,rsi ; rax=address of last valid input byte+1
+
+.LCALC_PROCESSED_BYTES:
+ sub rax,r10 ; sub the input buffer start address
+ ; rax = number of valid processed input bytes = return value
+
+ cmp rsi,rdx ; compare rdx (address of requested input) and rsi (address of done input)
+ je .LNO_ZERO_OUT
+
+ mov r15,rax ; number of elements to process
+
+ shl r15,1 ; number of output bytes
+
+ add r15,r11 ; pointer to next byte after full valid output buffer
+
+
+ VPXOR ymm0,ymm0,ymm0 ; all zero
+;ZERO VMOVDQU [r15],ymm0 ; zero out one register width after last output
+
+.LNO_ZERO_OUT:
+
+%ifdef __WIN__
+
+ VMOVDQA xmm6 ,[rsp ]
+ VMOVDQA xmm7 ,[rsp+1*16]
+ VMOVDQA xmm8 ,[rsp+2*16]
+ VMOVDQA xmm9 ,[rsp+3*16]
+ VMOVDQA xmm10,[rsp+4*16]
+ VMOVDQA xmm11,[rsp+5*16]
+ VMOVDQA xmm12,[rsp+6*16]
+ VMOVDQA xmm13,[rsp+7*16]
+ VMOVDQA xmm14,[rsp+8*16]
+ VMOVDQA xmm15,[rsp+9*16]
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+
+ ret
+
+;----------------------------------------------------------------------------------------------
+
+
+; xmm15 ; Source Load QL7
+; xmm14 ; Source Load QL6
+; xmm13 ; Source Load QL5
+; xmm12 ; Source Load QL4
+; xmm11 ; Source Load QL3
+; xmm10 ; Source Load QL2
+; xmm9 ; Source Load QL1
+; xmm8 ; Source Load QL0
+
+; xmm7 ; CONST ENCODE_SHUFFLE_TO_HEX
+; xmm6 ; CONST BITMASK_NIBBLE_3_IN_WORD
+; xmm5 ; Shift temp for High nibble 1
+; xmm4 ; Shift temp for High nibble 0
+; xmm3 ; Temp3
+; xmm2 ; Temp2
+; xmm1 ; Temp1
+; xmm0 ; Temp0
+
+
+
+%define NINP_BYTES_PER_ROUND 8*16
+%define NINP_BITSHIFT 7
+
+
+hex_encode_ssse3:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ MOVDQA [rsp ],xmm6
+ MOVDQA [rsp+1*16],xmm7
+ MOVDQA [rsp+2*16],xmm8
+ MOVDQA [rsp+3*16],xmm9
+ MOVDQA [rsp+4*16],xmm10
+ MOVDQA [rsp+5*16],xmm11
+ MOVDQA [rsp+6*16],xmm12
+ MOVDQA [rsp+7*16],xmm13
+ MOVDQA [rsp+8*16],xmm14
+ MOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+;; Loading QL0-QL3, prefetching QL4-QL7
+
+ MOVDQU xmm8, [rsi+0*16] ; AAA p_____5 p1____5 l3+ QL0
+ MOVDQU xmm9, [rsi+1*16] ; AAA p_____5 p1____5 l3+ QL1
+ MOVDQU xmm10,[rsi+2*16] ; AAA p_____5 p1____5 l3+ QL2
+ MOVDQU xmm11,[rsi+3*16] ; AAA p_____5 p1____5 l3+ QL3
+
+ MOVDQU xmm12,[rsi+4*16] ; BBB p_____5 p1____5 l3+ QL4
+ MOVDQU xmm13,[rsi+5*16] ; BBB p_____5 p1____5 l3+ QL5
+ MOVDQU xmm14,[rsi+6*16] ; BBB p_____5 p1____5 l3+ QL6
+ MOVDQU xmm15,[rsi+7*16] ; BBB p_____5 p1____5 l3+ QL7
+
+;; initialize constants
+
+ MOVDQA xmm7,[ENCODE_SHUFFLE_TO_HEX] ; p_23__ l3
+
+ MOVDQA xmm6,[BITMASK_LOWER_HALF] ; p_23__ l3
+
+;; do page overshoot checks
+
+ mov rax,NINP_BYTES_PER_ROUND
+
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ lea r8,[rdx+NINP_BYTES_PER_ROUND-1]
+ shr r8,NINP_BITSHIFT ; number of loops
+ shl r8,NINP_BITSHIFT
+ add r8,rsi ; r8 address of last byte+1 read in complete loops
+ add r8,NINP_BYTES_PER_ROUND-1 ; r8 address of last byte read in normal loop with overshoot
+
+ mov r11,r8
+
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte requested input
+ shr r8,NSHIFT_ADDRESS_TO_PAGE ; r8 page number of last byte read after normal round
+
+ cmp rcx,r8 ; stay on same page
+ je .LSAME_PAGE_IN_ROUND
+ sub rdx,rax ; don't overshoot in reading: do one round less
+
+.LSAME_PAGE_IN_ROUND:
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NINP_BYTES_PER_ROUND<<1 ; increment of output buffer for each round
+
+;; start preprocessing before loop
+
+ MOVDQA xmm4,xmm8
+ PSRLQ xmm4,4 ; AAA RL00,RL01 QL0 shift Hx to lower nibble in byte
+ MOVDQA xmm5,xmm9
+ PSRLQ xmm5,4 ; AAA RL02,RL03 QL1 shift Hx to lower nibble in byte
+
+ MOVDQA xmm0,xmm4
+ PUNPCKLBW xmm0,xmm8 ; AAA RL00 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ MOVDQA xmm1,xmm4
+ PUNPCKHBW xmm1,xmm8 ; AAA RL01 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ add rsi,rax ; add the number of processed array elements
+
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+.LHEXENCODE_LOOP:
+
+;; process unpacked AAA in XMM0-XMM4 and XMM8-XMM11, UNPCK BBB to XMM0-XMM1, PreLoad AAA to XMM8-XMM11
+;;RL00,RL01,RL02,RL03,RL04,RL05,RL06,RL07
+ MOVDQA xmm2,xmm5
+ PUNPCKLBW xmm2,xmm9 ; AAA RL02 p____5 l1 QQ0 [Lin0_RghtH] [... R06_HL1 R04_HL1 R06_HL0 R04_HL0]
+ MOVDQA xmm4,xmm10
+ PSRLQ xmm4,4 ; AAA RL04,RL05 QL2 shift Hx to lower nibble in byte
+ MOVDQA xmm3,xmm5
+ PUNPCKHBW xmm3,xmm9 ; AAA RL03 p____5 l1 QQ0 [Lin0_LeftH] [... R07_HL1 R05_HL1 R07_HL0 R05_HL0]
+ PAND xmm0,xmm6 ; AAA RL00 mask lower nibble
+
+ MOVDQA xmm5,xmm11
+ PSRLQ xmm5,4 ; AAA RL06,RL07 QL3 shift Hx to lower nibble in byte
+ MOVDQA xmm8,xmm4
+ PUNPCKLBW xmm8,xmm10 ; AAA RL04 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ PAND xmm1,xmm6 ; AAA RL01 mask lower nibble
+ MOVDQA xmm9,xmm4
+ PUNPCKHBW xmm9,xmm10 ; AAA RL05 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ PAND xmm2,xmm6 ; AAA RL02 mask lower nibble
+ MOVDQA xmm10,xmm5
+ PUNPCKLBW xmm10,xmm11 ; AAA RL06 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ PAND xmm3,xmm6 ; AAA RL03 mask lower nibble
+ MOVDQA xmm4,xmm5
+ PUNPCKHBW xmm4,xmm11 ; AAA RL07 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ MOVDQA xmm11,xmm4
+
+ MOVDQA xmm4,xmm7
+ PSHUFB xmm4,xmm0 ; AAA RL00 shuffle_to_hex_digits
+ PAND xmm8,xmm6 ; AAA RL04 mask lower nibble
+ MOVDQA xmm5,xmm7
+ PSHUFB xmm5,xmm1 ; AAA RL01 shuffle_to_hex_digits
+ PAND xmm9,xmm6 ; AAA RL05 mask lower nibble
+
+ MOVDQA xmm0,xmm7
+ PSHUFB xmm0,xmm2 ; AAA RL02 shuffle_to_hex_digits
+ MOVDQU [rdi+0*16],xmm4 ; AAA RL00 Store Hexdump
+ PAND xmm10,xmm6 ; AAA RL06 mask lower nibble
+
+ MOVDQA xmm1,xmm7
+ PSHUFB xmm1,xmm3 ; AAA RL03 shuffle_to_hex_digits
+ MOVDQU [rdi+1*16],xmm5 ; AAA RL01 Store Hexdump
+ PAND xmm11,xmm6 ; AAA RL07 mask lower nibble
+
+ MOVDQA xmm2,xmm7
+ PSHUFB xmm2,xmm8 ; AAA RL04 shuffle_to_hex_digits
+ MOVDQA xmm4,xmm12
+ PSRLQ xmm4,4 ; BBB RL08,RL09 QL4 shift Hx to lower nibble in byte
+ MOVDQU [rdi+2*16],xmm0 ; AAA RL02 Store Hexdump
+ MOVDQA xmm3,xmm7
+ PSHUFB xmm3,xmm9 ; AAA RL05 shuffle_to_hex_digits
+ MOVDQA xmm5,xmm13
+ PSRLQ xmm5,4 ; BBB RL10,RL11 QL5 shift Hx to lower nibble in byte
+ MOVDQU [rdi+3*16],xmm1 ; AAA RL03 Store Hexdump
+
+ MOVDQA xmm0,xmm7
+ PSHUFB xmm0,xmm10 ; AAA RL06 shuffle_to_hex_digits
+ MOVDQU [rdi+4*16],xmm2 ; AAA RL04 Store Hexdump
+ MOVDQU xmm8, [rsi+0*16] ; AAA p_____5 p1____5 l3+ QL0
+ MOVDQU [rdi+5*16],xmm3 ; AAA RL05 Store Hexdump
+ MOVDQU xmm9, [rsi+1*16] ; AAA p_____5 p1____5 l3+ QL1
+
+ MOVDQA xmm1,xmm7
+ PSHUFB xmm1,xmm11 ; AAA RL07 shuffle_to_hex_digits
+ MOVDQU [rdi+6*16],xmm0 ; AAA RL06 Store Hexdump
+ MOVDQA xmm0,xmm4
+ PUNPCKLBW xmm0,xmm12 ; BBB RL08 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ MOVDQU xmm10,[rsi+2*16] ; AAA p_____5 p1____5 l3+ QL2
+
+ MOVDQU [rdi+7*16],xmm1 ; AAA RL07 Store Hexdump
+ MOVDQU xmm11,[rsi+3*16] ; AAA p_____5 p1____5 l3+ QL3
+ MOVDQA xmm1,xmm4
+ PUNPCKHBW xmm1,xmm12 ; AAA RL09 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+
+;; process unpacked BBB in XMM0-XMM4 and XMM9-XMM15, UNPCK AAA to XMM0-XMM1, PreLoad BBB to XMM12-XMM15
+;;RL08,RL09,RL10,RL11,RL12,RL13,RL14,RL15
+ MOVDQA xmm2,xmm5
+ PUNPCKLBW xmm2,xmm13 ; BBB RL10 p____5 l1 QQ0 [Lin0_RghtH] [... R06_HL1 R04_HL1 R06_HL0 R04_HL0]
+ MOVDQA xmm4,xmm14
+ PSRLQ xmm4,4 ; BBB RL12,RL13 QL6 shift Hx to lower nibble in byte
+ MOVDQA xmm3,xmm5
+ PUNPCKHBW xmm3,xmm13 ; BBB RL11 p____5 l1 QQ0 [Lin0_LeftH] [... R07_HL1 R05_HL1 R07_HL0 R05_HL0]
+ PAND xmm0,xmm6 ; BBB RL08 mask lower nibble
+
+ MOVDQA xmm5,xmm15
+ PSRLQ xmm5,4 ; BBB RL14,RL15 QL7 shift Hx to lower nibble in byte
+ MOVDQA xmm12,xmm4
+ PUNPCKLBW xmm12,xmm14 ; BBB RL12 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ PAND xmm1,xmm6 ; BBB RL09 mask lower nibble
+ MOVDQA xmm13,xmm4
+ PUNPCKHBW xmm13,xmm14 ; BBB RL13 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ PAND xmm2,xmm6 ; BBB RL10 mask lower nibble
+ MOVDQA xmm14,xmm5
+ PUNPCKLBW xmm14,xmm15 ; BBB RL14 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ PAND xmm3,xmm6 ; BBB RL11 mask lower nibble
+ MOVDQA xmm4,xmm5
+ PUNPCKHBW xmm4,xmm15 ; BBB RL15 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ MOVDQA xmm15,xmm4
+
+ MOVDQA xmm4,xmm7
+ PSHUFB xmm4,xmm0 ; BBB RL08 shuffle_to_hex_digits
+ PAND xmm12,xmm6 ; BBB RL12 mask lower nibble
+ MOVDQA xmm5,xmm7
+ PSHUFB xmm5,xmm1 ; BBB RL09 shuffle_to_hex_digits
+ PAND xmm13,xmm6 ; BBB RL13 mask lower nibble
+
+ MOVDQA xmm0,xmm7
+ PSHUFB xmm0,xmm2 ; BBB RL10 shuffle_to_hex_digits
+ MOVDQU [rdi+8*16],xmm4 ; BBB RL08 Store Hexdump
+ PAND xmm14,xmm6 ; BBB RL14 mask lower nibble
+
+ MOVDQA xmm1,xmm7
+ PSHUFB xmm1,xmm3 ; BBB RL11 shuffle_to_hex_digits
+ MOVDQU [rdi+9*16],xmm5 ; BBB RL09 Store Hexdump
+ PAND xmm15,xmm6 ; BBB RL15 mask lower nibble
+
+ MOVDQA xmm2,xmm7
+ PSHUFB xmm2,xmm12 ; BBB RL12 shuffle_to_hex_digits
+ MOVDQA xmm4,xmm8
+ PSRLQ xmm4,4 ; AAA RL00,RL01 QL0 shift Hx to lower nibble in byte
+ MOVDQU [rdi+10*16],xmm0 ; BBB RL10 Store Hexdump
+ MOVDQA xmm3,xmm7
+ PSHUFB xmm3,xmm13 ; BBB RL13 shuffle_to_hex_digits
+ MOVDQA xmm5,xmm9
+ PSRLQ xmm5,4 ; AAA RL02,RL03 QL1 shift Hx to lower nibble in byte
+ MOVDQU [rdi+11*16],xmm1 ; BBB RL11 Store Hexdump
+
+ MOVDQA xmm0,xmm7
+ PSHUFB xmm0,xmm14 ; BBB RL14 shuffle_to_hex_digits
+ MOVDQU [rdi+12*16],xmm2 ; BBB RL12 Store Hexdump
+ MOVDQU xmm12, [rsi+4*16] ; BBB p_____5 p1____5 l3+ QL0
+ MOVDQU [rdi+13*16],xmm3 ; BBB RL13 Store Hexdump
+ MOVDQU xmm13, [rsi+5*16] ; BBB p_____5 p1____5 l3+ QL1
+
+ MOVDQA xmm1,xmm7
+ PSHUFB xmm1,xmm15 ; BBB RL15 shuffle_to_hex_digits
+ MOVDQU [rdi+14*16],xmm0 ; BBB RL14 Store Hexdump
+ MOVDQA xmm0,xmm4
+ PUNPCKLBW xmm0,xmm8 ; AAA RL00 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ MOVDQU xmm14,[rsi+6*16] ; BBB p_____5 p1____5 l3+ QL2
+
+ MOVDQU [rdi+15*16],xmm1 ; BBB RL15 Store Hexdump
+ MOVDQU xmm15,[rsi+7*16] ; BBB p_____5 p1____5 l3+ QL3
+
+ add rsi,rax ; add the number of processed array elements
+
+ MOVDQA xmm1,xmm4
+ PUNPCKHBW xmm1,xmm8 ; AAA RL01 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ add rdi,rcx ; add the number of processed output bytes
+
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXENCODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH_EXTRA
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH_NORMAL
+
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ xor rax,rax
+ jmp .LHEXENCODE_LOOP
+
+
+.LFINISH_EXTRA:
+ add rsi,NINP_BYTES_PER_ROUND ; add the extra round to get processed bytes
+ jmp .LFINISH
+
+.LFINISH_NORMAL:
+ sub rsi,NINP_BYTES_PER_ROUND ; sub the added prefetch round to get processed bytes
+
+.LFINISH:
+ ; r9 = address of requested input bytes+1
+ ; rsi = address of processed input bytes+1
+ ; now get the minimum of rdx,rsi to rax
+;; sub rsi,rax ; for last round do nothing (rax=0), else sub increment for one round
+;; sub r9,rax
+
+ mov rax,r12
+ cmp rsi,r12 ; get min from r12 (address of requested input) and rsi (address of done input)
+
+ jge .LCALC_PROCESSED_BYTES
+ mov rax,rsi ; rax=address of last valid input byte+1
+
+.LCALC_PROCESSED_BYTES:
+ sub rax,r10 ; sub the input buffer start address
+ ; rax = number of valid processed input bytes = return value
+
+ cmp rsi,rdx ; compare rdx (address of requested input) and rsi (address of done input)
+ je .LNO_ZERO_OUT
+
+ mov r15,rax ; number of elements to process
+
+ shl r15,1 ; number of output bytes
+
+ add r15,r11 ; pointer to next byte after full valid output buffer
+
+
+ PXOR xmm0,xmm0 ; all zero
+;ZERO MOVDQU [r15],xmm0 ; zero out one register width after last output
+
+.LNO_ZERO_OUT:
+
+%ifdef __WIN__
+
+ MOVDQA xmm6 ,[rsp ]
+ MOVDQA xmm7 ,[rsp+1*16]
+ MOVDQA xmm8 ,[rsp+2*16]
+ MOVDQA xmm9 ,[rsp+3*16]
+ MOVDQA xmm10,[rsp+4*16]
+ MOVDQA xmm11,[rsp+5*16]
+ MOVDQA xmm12,[rsp+6*16]
+ MOVDQA xmm13,[rsp+7*16]
+ MOVDQA xmm14,[rsp+8*16]
+ MOVDQA xmm15,[rsp+9*16]
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+
+
+;----------------------------------------------------------------------------------------------
+
+; k7 ; CONST BITMASK_ONE_ZERO 1010101010101010 selecting upper half
+; k6 ;
+; k5 ; CONST BITMASK_ZERO_ONE 0101010101010101 selecting lower half
+; k4 ; digit flags QL3
+; k3 ; digit flags QL2
+; k2 ; digit flags QL1
+; k1 ; digit flags QL0
+; k0 ;
+
+; zmm31 ; CONST ENCODE_SHUFFLE_TO_HEX
+; zmm30 ; CONST BITMASK_NIBBLE_3_IN_WORD
+; zmm29 ; CONST VPERM_ENCODE_OFFSETS
+; zmm28 ; CONST ALL_BYTES_39 ; CONST 48 = 39+9, calculated in the middle
+; zmm27 ; Unpack Upper RL5 RL7
+; zmm26 ; Unpack Lower RL4 RL6
+; zmm25 ; Unpack Upper RL1 RL3
+; zmm24 ; Unpack Lower RL0 RL2
+; zmm23 ; Source Load QLF
+; zmm22 ; Source Load QLE
+; zmm21 ; Source Load QLD
+; zmm20 ; Source Load QLC
+; zmm19 ; Source Load QLB
+; zmm18 ; Source Load QLA
+; zmm17 ; Source Load QL9
+; zmm16 ; Source Load QL8
+
+; zmm15 ; Source Load QL7
+; zmm14 ; Source Load QL6
+; zmm13 ; Source Load QL5
+; zmm12 ; Source Load QL4
+; zmm11 ; Source Load QL3
+; zmm10 ; Source Load QL2
+; zmm9 ; Source Load QL1
+; zmm8 ; Source Load QL0
+; zmm7 ; RL3
+; zmm6 ; RL3
+; zmm5 ; RL2
+; zmm4 ; RL2
+; zmm3 ;
+; zmm2 ; RL1
+; zmm1 ; CONST ALL bytes 48
+; zmm0 ; RL0
+
+%define NHALF_INP_BYTES_PER_ROUND 8*64
+%define NINP_BYTES_PER_ROUND 2*NHALF_INP_BYTES_PER_ROUND
+%define NINP_BITSHIFT 10
+
+hex_encode_avx512bw:
+
+ sub rsp,STACK_ADJ
+
+ mov [rsp+STACK_FOR_XMM+0*8],rdi
+ mov [rsp+STACK_FOR_XMM+1*8],rsi
+ mov [rsp+STACK_FOR_XMM+2*8],r12
+ mov [rsp+STACK_FOR_XMM+3*8],r14
+ mov [rsp+STACK_FOR_XMM+4*8],r15
+
+%ifdef __WIN__
+
+ VMOVDQA [rsp ],xmm6
+ VMOVDQA [rsp+1*16],xmm7
+ VMOVDQA [rsp+2*16],xmm8
+ VMOVDQA [rsp+3*16],xmm9
+ VMOVDQA [rsp+4*16],xmm10
+ VMOVDQA [rsp+5*16],xmm11
+ VMOVDQA [rsp+6*16],xmm12
+ VMOVDQA [rsp+7*16],xmm13
+ VMOVDQA [rsp+8*16],xmm14
+ VMOVDQA [rsp+9*16],xmm15
+
+ mov rdi,rcx ; parameter 1 output buffer
+
+ mov rsi,rdx ; parameter 2 input buffer
+
+ mov rdx,r8 ; parameter 3 number of elements
+
+%endif
+
+ VMOVDQA64 zmm29,[VPERM_ENCODE_OFFSETS] ; p_23__ l3
+
+;; initializer for QQ0 and QQ1
+
+ VPERMQ zmm8 ,zmm29,[rsi+0*64] ; AAA p____5 l3+ QL0 RL00,RL01
+ VPERMQ zmm9 ,zmm29,[rsi+1*64] ; AAA p____5 l3+ QL1 RL02,RL03
+ VPERMQ zmm10,zmm29,[rsi+2*64] ; AAA p____5 l3+ QL2 RL04,RL05
+ VPERMQ zmm11,zmm29,[rsi+3*64] ; AAA p____5 l3+ QL3 RL06,RL07
+
+ VPERMQ zmm12,zmm29,[rsi+4*64] ; BBB p____5 l3+ QL4 RL08,RL09
+ VPERMQ zmm13,zmm29,[rsi+5*64] ; BBB p____5 l3+ QL5 RL10,RL11
+ VPERMQ zmm14,zmm29,[rsi+6*64] ; BBB p____5 l3+ QL6 RL12,RL13
+ VPERMQ zmm15,zmm29,[rsi+7*64] ; BBB p____5 l3+ QL7 RL14,RL15
+
+ add rsi,rax ; add half the number of processed array elements
+
+ VPERMQ zmm16,zmm29,[rsi+0*64] ; CCC p____5 l3+ QL8 RL16,RL17
+ VPERMQ zmm17,zmm29,[rsi+1*64] ; CCC p____5 l3+ QL9 RL18,RL19
+ VPERMQ zmm18,zmm29,[rsi+2*64] ; CCC p____5 l3+ QLA RL20,RL21
+ VPERMQ zmm19,zmm29,[rsi+3*64] ; CCC p____5 l3+ QLB RL22,RL23
+
+ VPERMQ zmm20,zmm29,[rsi+4*64] ; DDD p____5 l3+ QLC RL24,RL25
+ VPERMQ zmm21,zmm29,[rsi+5*64] ; DDD p____5 l3+ QLD RL26,RL27
+ VPERMQ zmm22,zmm29,[rsi+6*64] ; DDD p____5 l3+ QLE RL28,RL29
+ VPERMQ zmm23,zmm29,[rsi+7*64] ; DDD p____5 l3+ QLF RL30,RL31
+
+;; initialize constants
+
+ KMOVQ k7,[BITMASK_ONE_ZERO]
+
+ VMOVDQA64 zmm31,[ENCODE_SHUFFLE_TO_HEX] ; p_23__ l3
+ VMOVDQA64 zmm1,zmm31
+
+ VPBROADCASTQ zmm30,[BITMASK_LOWER_HALF] ; p_23__ l3
+
+
+ VMOVDQA64 zmm28,[ENCODE_SHUFFLE_TO_HIGH_LOW] ; p_23__ l3
+
+;; do page overshoot checks
+
+ mov rax,NHALF_INP_BYTES_PER_ROUND
+
+
+ mov r9,rdx ; exact requested number of elements to process
+ add r9,rsi ; r9 last valid pointer +1 of requested input buffer
+
+ mov r10,rsi ; r10 saved start of input buffer
+ mov r12,r9 ; r12 save of end of input buffer+1
+
+ lea rcx,[rsi+rdx-1] ; rcx address of last byte requested to read
+
+ lea r8,[rdx+NINP_BYTES_PER_ROUND-1]
+ shr r8,NINP_BITSHIFT ; number of loops
+ shl r8,NINP_BITSHIFT
+ add r8,rsi ; r8 address of last byte+1 read in complete loops
+ add r8,NINP_BYTES_PER_ROUND-1 ; r8 address of last byte read in normal loop with overshoot
+
+ mov r11,r8
+
+; DISABLED for NO OVERSHOOT
+; add r11,rax ; r11 address of last byte of prefetched data (ONLY HALF A ROUND!)
+ sub r11,rax ; r11 address of last byte of prefetched data (ONLY HALF A ROUND!)
+ sub r11,rax ; r11 address of last byte of prefetched data (ONLY HALF A ROUND!)
+ sub r11,rax ; r11 address of last byte of prefetched data (ONLY HALF A ROUND!)
+
+ shr rcx,NSHIFT_ADDRESS_TO_PAGE ; rcx page number of last byte after normal round
+ shr r8,NSHIFT_ADDRESS_TO_PAGE ; r8 page number of last byte after prefetch
+
+ cmp rcx,r8 ; stay on same page
+ je .LSAME_PAGE_IN_ROUND
+ sub rdx,rax ; don't overshoot in reading: do one round less
+ sub rdx,rax ; don't overshoot in reading: do one round less
+
+.LSAME_PAGE_IN_ROUND:
+ shr r11,NSHIFT_ADDRESS_TO_PAGE ; r11 page number of byte after prefetched data
+ cmp rcx,r11
+ je .LSAME_PAGE_IN_PREFETCH
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+ sub rdx,rax ; don't overshoot in prefetch reading: do one round less
+
+.LSAME_PAGE_IN_PREFETCH:
+ add rdx,rsi ; rdx last valid pointer+1 for normal loop
+
+ ; due to prefetch add one round to end checks
+ add rdx,rax
+ add r9,rax
+
+ mov r11,rdi ; r11 saved start of output buffer
+
+ mov rcx,NHALF_INP_BYTES_PER_ROUND<<1 ; increment of output buffer for each round
+
+
+;; start preprocessing before loop
+
+ VPSRLQ zmm2,zmm8,4 ; AAA RL00+RL01 QL0 shift Hx to lower nibble in byte
+ VPSRLQ zmm3,zmm9,4 ; AAA RL02+RL03 QL1 shift Hx to lower nibble in byte
+
+ VPUNPCKLBW zmm0,zmm2,zmm8 ; AAA RL00 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPUNPCKHBW zmm1,zmm2,zmm8 ; AAA RL01 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSRLQ zmm6,zmm10,4 ; AAA RL04+RL05 QL2 shift Hx to lower nibble in byte
+ VPSRLQ zmm7,zmm11,4 ; AAA RL06+RL07 QL3 shift Hx to lower nibble in byte
+
+ VPUNPCKLBW zmm2,zmm3,zmm9 ; AAA RL02 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPUNPCKHBW zmm3,zmm3,zmm9 ; AAA RL03 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ add rsi,rax ; add half the number of processed array elements
+
+ align 32
+
+; ;IACA START_MARKER
+; mov ebx, 111
+; db 0x64, 0x67, 0x90
+
+; Vector Port info AVX512
+; ----------------------------------------
+; VPShift p0 l1
+; VPMax/Min p0 l1
+; VPMUL p0 l5 ; with 2FMA-Units p05 (SKX,CLX etc.)
+; VPMOVB2M p0 l3
+; VPSUBUSB /SSB p0 l1
+
+; VPALIGNR p5 l1 ;Shift of n*8 bits!
+; VPERM p5 l3
+; VPERMI2x 1*p05+2*p5 l7 ; (l9 with flags)
+; VPCompare p5 l3-l4
+; VP Pack/Unpack p5 l1(SKX) l3(TGL)
+; VPSHUF p5 l1
+
+
+.LHEXENCODE_LOOP:
+
+;; AAA+BBB
+; process unpacked AAA (QL0-QL4=RL00-RL07) in zmm0-zmm7 and process BBB (QL4-QL7=RL08-RL15) in zmm8-zmm15 and zmm2+zmm3
+
+ VPUNPCKLBW zmm4,zmm6,zmm10 ; AAA RL04 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm0,zmm0,zmm30 ; AAA RL00 mask lower nibble
+
+ VPUNPCKHBW zmm5,zmm6,zmm10 ; AAA RL05 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPSRLQ zmm10,zmm12,4 ; BBB RL08+RL09 QL4 shift Hx to lower nibble in byte
+ VPUNPCKLBW zmm8,zmm10,zmm12 ; BBB RL08 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm1,zmm1,zmm30 ; AAA RL01 mask lower nibble
+
+ VPUNPCKLBW zmm6,zmm7,zmm11 ; AAA RL06 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm2,zmm2,zmm30 ; AAA RL02 mask lower nibble
+ VPUNPCKHBW zmm7,zmm7,zmm11 ; AAA RL07 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPSRLQ zmm11,zmm13,4 ; BBB RL10+RL11 QL5 shift Hx to lower nibble in byte
+ VPANDQ zmm8,zmm8,zmm30 ; BBB RL08 mask lower nibble
+
+ VPUNPCKHBW zmm9,zmm10,zmm12 ; BBB RL09 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPANDQ zmm3,zmm3,zmm30 ; AAA RL03 mask lower nibble
+ VPUNPCKLBW zmm10,zmm11,zmm13 ; BBB RL10 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm9,zmm9,zmm30 ; BBB RL09 mask lower nibble
+
+ VPSHUFB zmm0,zmm31,zmm0 ; AAA RL00 shuffle_to_hex_digits
+ VPANDQ zmm4,zmm4,zmm30 ; AAA RL04 mask lower nibble
+ VPUNPCKHBW zmm11,zmm11,zmm13 ; BBB RL11 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPANDQ zmm10,zmm10,zmm30 ; BBB RL10 mask lower nibble
+
+ VPSHUFB zmm1,zmm31,zmm1 ; AAA RL01 shuffle_to_hex_digits
+ VPANDQ zmm5,zmm5,zmm30 ; AAA RL05 mask lower nibble
+ VPSHUFB zmm8,zmm31,zmm8 ; BBB RL08 shuffle_to_hex_digits
+ VPANDQ zmm11,zmm11,zmm30 ; BBB RL11 mask lower nibble
+
+ VPSHUFB zmm2,zmm31,zmm2 ; AAA RL02 shuffle_to_hex_digits
+ VMOVDQA64 [rdi+0*64],zmm0 ; AAA RL00 Store Hexdump
+ VMOVDQA64 [rdi+1*64],zmm1 ; AAA RL01 Store Hexdump
+ VPANDQ zmm6,zmm6,zmm30 ; AAA RL06 mask lower nibble
+ VPSHUFB zmm9,zmm31,zmm9 ; BBB RL09 shuffle_to_hex_digits
+
+ VPSHUFB zmm3,zmm31,zmm3 ; AAA RL03 shuffle_to_hex_digits
+ VPANDQ zmm7,zmm7,zmm30 ; AAA RL07 mask lower nibble
+ VMOVDQA64 [rdi+2*64],zmm2 ; AAA RL02 Store Hexdump
+ VPSRLQ zmm2,zmm14,4 ; BBB RL12+RL13 QL6 shift Hx to lower nibble in byte
+ VPSHUFB zmm10,zmm31,zmm10 ; BBB RL10 shuffle_to_hex_digits
+ VMOVDQA64 [rdi+3*64],zmm3 ; AAA RL03 Store Hexdump
+
+ VPSRLQ zmm3,zmm15,4 ; BBB RL14+RL15 QL7 shift Hx to lower nibble in byte
+ VPUNPCKLBW zmm12,zmm2,zmm14 ; BBB RL12 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPSHUFB zmm4,zmm31,zmm4 ; AAA RL04 shuffle_to_hex_digits
+ VMOVDQA64 [rdi+4*64],zmm4 ; AAA RL04 Store Hexdump
+ VPSHUFB zmm11,zmm31,zmm11 ; BBB RL11 shuffle_to_hex_digits
+ VPUNPCKHBW zmm13,zmm2,zmm14 ; BBB RL13 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSHUFB zmm5,zmm31,zmm5 ; AAA RL05 shuffle_to_hex_digits
+ VPANDQ zmm12,zmm12,zmm30 ; BBB RL12 mask lower nibble
+ VPUNPCKLBW zmm14,zmm3,zmm15 ; BBB RL14 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VMOVDQA64 [rdi+5*64],zmm5 ; AAA RL05 Store Hexdump
+
+ VPSHUFB zmm6,zmm31,zmm6 ; AAA RL06 shuffle_to_hex_digits
+ VPANDQ zmm13,zmm13,zmm30 ; BBB RL13 mask lower nibble
+ VPUNPCKHBW zmm15,zmm3,zmm15 ; BBB RL15 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSHUFB zmm7,zmm31,zmm7 ; AAA RL07 shuffle_to_hex_digits
+ VPANDQ zmm14,zmm14,zmm30 ; BBB RL14 mask lower nibble
+ VMOVDQA64 [rdi+6*64],zmm6 ; AAA RL06 Store Hexdump
+ VMOVDQA64 [rdi+7*64],zmm7 ; AAA RL07 Store Hexdump
+ VPSHUFB zmm12,zmm31,zmm12 ; BBB RL12 shuffle_to_hex_digits
+ VPANDQ zmm15,zmm15,zmm30 ; BBB RL15 mask lower nibble
+;
+ VMOVDQA64 [rdi+8*64],zmm8 ; BBB RL08 Store Hexdump
+ VPERMQ zmm8 ,zmm29,[rsi+0*64] ; AAA p____5 l3+ QL0 RL00,RL01
+ VMOVDQA64 [rdi+9*64],zmm9 ; BBB RL09 Store Hexdump
+ VPERMQ zmm9 ,zmm29,[rsi+1*64] ; AAA p____5 l3+ QL1 RL02,RL03
+ VPSHUFB zmm13,zmm31,zmm13 ; BBB RL13 shuffle_to_hex_digits
+
+ VMOVDQA64 [rdi+10*64],zmm10 ; BBB RL10 Store Hexdump
+ VPERMQ zmm10,zmm29,[rsi+2*64] ; AAA p____5 l3+ QL2 RL04,RL05
+ VPSRLQ zmm2,zmm16,4 ; CCC RL16+RL17 QL8 shift Hx to lower nibble in byte
+ VMOVDQA64 [rdi+11*64],zmm11 ; BBB RL11 Store Hexdump
+ VPERMQ zmm11,zmm29,[rsi+3*64] ; AAA p____5 l3+ QL3 RL06,RL07
+
+ VPSHUFB zmm14,zmm31,zmm14 ; BBB RL14 shuffle_to_hex_digits
+ VPSRLQ zmm3,zmm17,4 ; CCC RL18+RL19 QL9 shift Hx to lower nibble in byte
+ VPSHUFB zmm15,zmm31,zmm15 ; BBB RL15 shuffle_to_hex_digits
+
+ VPUNPCKLBW zmm0,zmm2,zmm16 ; CCC RL16 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPSRLQ zmm6,zmm18,4 ; CCC RL20+RL21 QLA shift Hx to lower nibble in byte
+ VMOVDQA64 [rdi+12*64],zmm12 ; BBB RL12 Store Hexdump
+ VPERMQ zmm12,zmm29,[rsi+4*64] ; BBB p____5 l3+ QL4 RL08,RL09
+
+ VPUNPCKHBW zmm1,zmm2,zmm16 ; CCC RL17 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPSRLQ zmm7,zmm19,4 ; CCC RL22+RL23 QLB shift Hx to lower nibble in byte
+ VMOVDQA64 [rdi+13*64],zmm13 ; BBB RL13 Store Hexdump
+ VPERMQ zmm13,zmm29,[rsi+5*64] ; BBB p____5 l3+ QL5 RL10,RL11
+
+ VMOVDQA64 [rdi+14*64],zmm14 ; BBB RL14 Store Hexdump
+ VPERMQ zmm14,zmm29,[rsi+6*64] ; BBB p____5 l3+ QL6 RL12,RL13
+ VPUNPCKLBW zmm2,zmm3,zmm17 ; CCC RL18 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VMOVDQA64 [rdi+15*64],zmm15 ; BBB RL15 Store Hexdump
+
+ add rdi,rcx ; add half the number of processed output bytes
+
+ VPERMQ zmm15,zmm29,[rsi+7*64] ; BBB p____5 l3+ QL7 RL14,RL15
+ VPUNPCKHBW zmm3,zmm3,zmm17 ; CCC RL19 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+
+;; CCC+DDD
+; process unpacked CCC (QL8-QLC=RL16-RL23) in zmm0-zmm7 and process DDD (QLC-QLF=RL24-RL31) in zmm16-zmm23 and zmm2+zmm3
+ add rsi,rax ; add half the number of processed array elements
+
+ VPUNPCKLBW zmm4,zmm6,zmm18 ; CCC RL20 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm0,zmm0,zmm30 ; CCC RL16 mask lower nibble
+
+ VPUNPCKHBW zmm5,zmm6,zmm18 ; CCC RL21 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPSRLQ zmm18,zmm20,4 ; DDD RL24+RL25 QLC shift Hx to lower nibble in byte
+ VPUNPCKLBW zmm16,zmm18,zmm20 ; DDD RL24 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm1,zmm1,zmm30 ; CCC RL17 mask lower nibble
+
+ VPUNPCKLBW zmm6,zmm7,zmm19 ; CCC RL22 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm2,zmm2,zmm30 ; CCC RL18 mask lower nibble
+ VPUNPCKHBW zmm7,zmm7,zmm19 ; CCC RL23 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPSRLQ zmm19,zmm21,4 ; DDD RL26+RL27 QLD shift Hx to lower nibble in byte
+ VPANDQ zmm16,zmm16,zmm30 ; DDD RL24 mask lower nibble
+
+ VPUNPCKHBW zmm17,zmm18,zmm20 ; DDD RL25 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPANDQ zmm3,zmm3,zmm30 ; CCC RL19 mask lower nibble
+ VPUNPCKLBW zmm18,zmm19,zmm21 ; DDD RL26 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPANDQ zmm17,zmm17,zmm30 ; DDD RL25 mask lower nibble
+
+ VPSHUFB zmm0,zmm31,zmm0 ; CCC RL16 shuffle_to_hex_digits
+ VPANDQ zmm4,zmm4,zmm30 ; CCC RL20 mask lower nibble
+ VPUNPCKHBW zmm19,zmm19,zmm21 ; DDD RL27 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPANDQ zmm18,zmm18,zmm30 ; DDD RL26 mask lower nibble
+
+ VPSHUFB zmm1,zmm31,zmm1 ; CCC RL17 shuffle_to_hex_digits
+ VPANDQ zmm5,zmm5,zmm30 ; CCC RL21 mask lower nibble
+ VPSHUFB zmm16,zmm31,zmm16 ; DDD RL24 shuffle_to_hex_digits
+ VPANDQ zmm19,zmm19,zmm30 ; DDD RL27 mask lower nibble
+
+ VPSHUFB zmm2,zmm31,zmm2 ; CCC RL18 shuffle_to_hex_digits
+ VMOVDQA64 [rdi+0*64],zmm0 ; CCC RL16 Store Hexdump
+ VMOVDQA64 [rdi+1*64],zmm1 ; CCC RL17 Store Hexdump
+ VPANDQ zmm6,zmm6,zmm30 ; CCC RL22 mask lower nibble
+ VPSHUFB zmm17,zmm31,zmm17 ; DDD RL25 shuffle_to_hex_digits
+
+ VPSHUFB zmm3,zmm31,zmm3 ; CCC RL19 shuffle_to_hex_digits
+ VPANDQ zmm7,zmm7,zmm30 ; CCC RL23 mask lower nibble
+ VMOVDQA64 [rdi+2*64],zmm2 ; CCC RL18 Store Hexdump
+ VPSRLQ zmm2,zmm22,4 ; DDD RL28+RL29 QLE shift Hx to lower nibble in byte
+ VPSHUFB zmm18,zmm31,zmm18 ; DDD RL26 shuffle_to_hex_digits
+ VMOVDQA64 [rdi+3*64],zmm3 ; CCC RL19 Store Hexdump
+
+ VPSRLQ zmm3,zmm23,4 ; DDD RL30+RL31 QLF shift Hx to lower nibble in byte
+ VPUNPCKLBW zmm20,zmm2,zmm22 ; DDD RL28 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPSHUFB zmm4,zmm31,zmm4 ; CCC RL20 shuffle_to_hex_digits
+ VMOVDQA64 [rdi+4*64],zmm4 ; CCC RL20 Store Hexdump
+ VPSHUFB zmm19,zmm31,zmm19 ; DDD RL27 shuffle_to_hex_digits
+ VPUNPCKHBW zmm21,zmm2,zmm22 ; DDD RL29 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSHUFB zmm5,zmm31,zmm5 ; CCC RL21 shuffle_to_hex_digits
+ VPANDQ zmm20,zmm20,zmm30 ; DDD RL28 mask lower nibble
+ VPUNPCKLBW zmm22,zmm3,zmm23 ; DDD RL30 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VMOVDQA64 [rdi+5*64],zmm5 ; CCC RL21 Store Hexdump
+
+ VPSHUFB zmm6,zmm31,zmm6 ; CCC RL22 shuffle_to_hex_digits
+ VPANDQ zmm21,zmm21,zmm30 ; DDD RL29 mask lower nibble
+ VPUNPCKHBW zmm23,zmm3,zmm23 ; DDD RL31 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ VPSHUFB zmm7,zmm31,zmm7 ; CCC RL23 shuffle_to_hex_digits
+ VPANDQ zmm22,zmm22,zmm30 ; DDD RL30 mask lower nibble
+ VMOVDQA64 [rdi+6*64],zmm6 ; CCC RL22 Store Hexdump
+ VMOVDQA64 [rdi+7*64],zmm7 ; CCC RL23 Store Hexdump
+ VPSHUFB zmm20,zmm31,zmm20 ; DDD RL28 shuffle_to_hex_digits
+ VPANDQ zmm23,zmm23,zmm30 ; DDD RL31 mask lower nibble
+;
+ VMOVDQA64 [rdi+8*64],zmm16 ; DDD RL24 Store Hexdump
+ VPERMQ zmm16,zmm29,[rsi+0*64] ; CCC p____5 l3+ QL8 RL16,RL17
+ VMOVDQA64 [rdi+9*64],zmm17 ; DDD RL25 Store Hexdump
+ VPERMQ zmm17,zmm29,[rsi+1*64] ; CCC p____5 l3+ QL9 RL18,RL19
+ VPSHUFB zmm21,zmm31,zmm21 ; DDD RL29 shuffle_to_hex_digits
+
+ VMOVDQA64 [rdi+10*64],zmm18 ; DDD RL26 Store Hexdump
+ VPERMQ zmm18,zmm29,[rsi+2*64] ; CCC p____5 l3+ QLA RL20,RL21
+ VPSRLQ zmm2,zmm8,4 ; AAA RL00+RL01 QL0 shift Hx to lower nibble in byte
+ VMOVDQA64 [rdi+11*64],zmm19 ; DDD RL27 Store Hexdump
+ VPERMQ zmm19,zmm29,[rsi+3*64] ; CCC p____5 l3+ QLB RL22,RL23
+
+ VPSHUFB zmm22,zmm31,zmm22 ; DDD RL30 shuffle_to_hex_digits
+ VPSRLQ zmm3,zmm9,4 ; AAA RL02+RL03 QL1 shift Hx to lower nibble in byte
+ VPSHUFB zmm23,zmm31,zmm23 ; DDD RL31 shuffle_to_hex_digits
+
+ VPUNPCKLBW zmm0,zmm2,zmm8 ; AAA RL00 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VPSRLQ zmm6,zmm10,4 ; AAA RL04+RL05 QL2 shift Hx to lower nibble in byte
+ VMOVDQA64 [rdi+12*64],zmm20 ; DDD RL28 Store Hexdump
+ VPERMQ zmm20,zmm29,[rsi+4*64] ; DDD p____5 l3+ QLC RL24,RL25
+
+ VPUNPCKHBW zmm1,zmm2,zmm8 ; AAA RL01 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+ VPSRLQ zmm7,zmm11,4 ; AAA RL06+RL07 QL3 shift Hx to lower nibble in byte
+ VMOVDQA64 [rdi+13*64],zmm21 ; DDD RL29 Store Hexdump
+ VPERMQ zmm21,zmm29,[rsi+5*64] ; DDD p____5 l3+ QLD RL26,RL27
+
+ VMOVDQA64 [rdi+14*64],zmm22 ; DDD RL30 Store Hexdump
+ VPERMQ zmm22,zmm29,[rsi+6*64] ; DDD p____5 l3+ QLE RL28,RL29
+ VPUNPCKLBW zmm2,zmm3,zmm9 ; AAA RL02 p____5 l1 QQ0 [Lin0_RghtH] [... R02_HL1 R00_HL1 R02_HL0 R00_HL0]
+ VMOVDQA64 [rdi+15*64],zmm23 ; DDD RL31 Store Hexdump
+ VPERMQ zmm23,zmm29,[rsi+7*64] ; DDD p____5 l3+ QLF RL30,RL31
+
+ add rsi,rax ; add half the number of processed array elements
+
+ VPUNPCKHBW zmm3,zmm3,zmm9 ; AAA RL03 p____5 l1 QQ0 [Lin0_LeftH] [... R03_HL1 R01_HL1 R03_HL0 R01_HL0]
+
+ add rdi,rcx ; add half the number of processed output bytes
+
+ cmp rsi,rdx ; compare incremented rsi (= pointer to next chunk) to last valid pointer +1
+ jl .LHEXENCODE_LOOP
+
+ ; end of normal loop reached
+ ; we can do one more round when original count has been reduced by one round
+ cmp rax,0
+ je .LFINISH_EXTRA
+
+ cmp rdx,r9 ; input buffer length was not reduced when equal
+ je .LFINISH_NORMAL
+
+ add rax,rax ; rax is only half the bytes of input round, so double it
+ sub rsi,rax ; for prefetching the last round, load the last round again
+ sub rdx,rax ; adopt and condition for last round also
+ mov rax,0
+ jmp .LHEXENCODE_LOOP
+
+
+.LFINISH_EXTRA:
+ add rsi,NINP_BYTES_PER_ROUND ; add the extra round to get processed bytes
+ jmp .LFINISH
+
+.LFINISH_NORMAL:
+ sub rsi,NINP_BYTES_PER_ROUND ; sub the added prefetch round to get processed bytes
+
+.LFINISH:
+ ; r9 = address of requested input bytes+1
+ ; rsi = address of processed input bytes+1
+ ; now get the minimum of rdx,rsi to rax
+;; sub rsi,rax ; for last round do nothing (rax=0), else sub increment for one round
+;; sub r9,rax
+
+ mov rax,r12
+ cmp rsi,r12 ; get min from rdx (address of requested input) and rsi (address of done input)
+
+ jge .LCALC_PROCESSED_BYTES
+ mov rax,rsi ; rax=address of last valid input byte+1
+
+.LCALC_PROCESSED_BYTES:
+ sub rax,r10 ; sub the input buffer start address
+ ; rax = number of valid processed input bytes = return value
+
+ cmp rsi,rdx ; compare rdx (address of requested input) and rsi (address of done input)
+ je .LNO_ZERO_OUT
+
+ mov r15,rax ; number of elements to process
+
+ shl r15,1 ; number of output bytes
+
+ add r15,r11 ; pointer to next byte after full valid output buffer
+
+
+ VPXORQ zmm0,zmm0,zmm0 ; all zero
+;ZERO VMOVDQU64 [r15],zmm0 ; zero out one register width after last output
+
+.LNO_ZERO_OUT:
+
+%ifdef __WIN__
+
+ VMOVDQA xmm6 ,[rsp ]
+ VMOVDQA xmm7 ,[rsp+1*16]
+ VMOVDQA xmm8 ,[rsp+2*16]
+ VMOVDQA xmm9 ,[rsp+3*16]
+ VMOVDQA xmm10,[rsp+4*16]
+ VMOVDQA xmm11,[rsp+5*16]
+ VMOVDQA xmm12,[rsp+6*16]
+ VMOVDQA xmm13,[rsp+7*16]
+ VMOVDQA xmm14,[rsp+8*16]
+ VMOVDQA xmm15,[rsp+9*16]
+
+%endif
+
+ mov rdi,[rsp+STACK_FOR_XMM+0*8]
+ mov rsi,[rsp+STACK_FOR_XMM+1*8]
+ mov r12,[rsp+STACK_FOR_XMM+2*8]
+ mov r14,[rsp+STACK_FOR_XMM+3*8]
+ mov r15,[rsp+STACK_FOR_XMM+4*8]
+
+ add rsp,STACK_ADJ
+
+ ret
+
+;----------------------------------------------------------------------------------------------
+
+%endif
diff --git a/postgresql-15devel_orig/src/backend/utils/adt/varlena.c b/postgresql-15devel/src/backend/utils/adt/varlena.c
index bd3091b..183f67f 100644
--- a/postgresql-15devel_orig/src/backend/utils/adt/varlena.c
+++ b/postgresql-15devel/src/backend/utils/adt/varlena.c
@@ -397,7 +397,7 @@ byteaout(PG_FUNCTION_ARGS)
if (bytea_output == BYTEA_OUTPUT_HEX)
{
/* Print hex format */
- rp = result = palloc(VARSIZE_ANY_EXHDR(vlena) * 2 + 2 + 1);
+ rp = result = palloc(hex_enc_len(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena)) + 2 + 1);
*rp++ = '\\';
*rp++ = 'x';
rp += hex_encode(VARDATA_ANY(vlena), VARSIZE_ANY_EXHDR(vlena), rp);
diff --git a/postgresql-15devel_orig/src/include/utils/builtins.h b/postgresql-15devel/src/include/utils/builtins.h
index b07eefa..e6efb73 100644
--- a/postgresql-15devel_orig/src/include/utils/builtins.h
+++ b/postgresql-15devel/src/include/utils/builtins.h
@@ -35,6 +35,9 @@ extern int errdomainconstraint(Oid datatypeOid, const char *conname);
extern uint64 hex_encode(const char *src, size_t len, char *dst);
extern uint64 hex_decode(const char *src, size_t len, char *dst);
+extern uint64 hex_enc_len(const char *src, size_t srclen);
+extern uint64 hex_dec_len(const char *src, size_t srclen);
+
/* int.c */
extern int2vector *buildint2vector(const int16 *int2s, int n);
On Fri, Dec 31, 2021 at 9:32 AM Hans Buschmann <buschmann@nidsa.net> wrote:
Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with array-oriented data should be drastically accelerated by using SIMD instructions on modern hardware.
Hi Hans,
I have experimented with SIMD within Postgres last year, so I have
some idea of the benefits and difficulties. I do think we can profit
from SIMD more, but we must be very careful to manage complexity and
maximize usefulness. Hopefully I can offer some advice.
- restrict on 64 -bit architectures
These are the dominant server architectures, have the necessary data formats and corresponding registers and operating instructions
- start with Intel x86-64 SIMD instructions:
This is the vastly most used platform, available for development and in practical use
- don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can jump in and implement comparable algorithms
- fallback to the established implementation in postgres in non appropriate cases or on user request (GUC)
These are all reasonable goals, except GUCs are the wrong place to
choose hardware implementations -- it should Just Work.
- coding for maximum hardware usage instead of elegant programming
Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard implementaion in C
-1
Maintaining good programming style is a key goal of the project. There
are certainly non-elegant parts in the code, but that has a cost and
we must consider tradeoffs carefully. I have read some of the
optimized code in glibc and it is not fun. They at least know they are
targeting one OS and one compiler -- we don't have that luxury.
- focus optimization for the most advanced SIMD instruction set: AVX512
This provides the most advanced instructions and quite a lot of large registers to aid in latency avoiding
-1
AVX512 is a hodge-podge of different instruction subsets and are
entirely lacking on some recent Intel server hardware. Also only
available from a single chipmaker thus far.
- The loops implementing the algorithm are written in NASM assembler:
NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions implemented and is fast.
- The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported:
This includes Linux and Windows as the most important ones
- The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses:
The assembly implementation gives you the full control over the registers (unlike intrinsics)
On the other hand, intrinsics are easy to integrate into a C codebase
and relieve us from thinking about object formats. A performance
feature that happens to work only on common OS's is probably fine from
the user point of view, but if we have to add a lot of extra stuff to
make it work at all, that's not a good trade off. "Mostly independent"
of the OS is not acceptable -- we shouldn't have to think about the OS
at all when the coding does not involve OS facilities (I/O, processes,
etc).
As an example I think of pg_dump to dump a huge amount of bytea data (not uncommon in real applications). Most of these data are in toast tables, often uncompressed due to their inherant structure. The dump must read the toast pages into memory, decompose the page, hexdump the content, put the result in an output buffer and trigger the I/O. By integrating all these steps into one big performance improvements can be achieved (but naturally not here in my first implementation!).
Seems like a reasonable area to work on, but I've never measured.
The best result I could achieve was roughly 95 seconds for 1 Million dumps of 1718 KB on a Tigerlake laptop using AVX512. This gives about 18 GB/s source-hexdumping rate on a single core!
In another run with postgres the time to hexdump about half a million tuples with a bytea column yeilding about 6 GB of output reduced the time from about 68 seconds to 60 seconds, which clearly shows the postgres overhead for executing the copy command on such a data set.
I don't quite follow -- is this patched vs. unpatched Postgres? I'm
not sure what's been demonstrated.
The assembler routines should work on most x86-64 operating systems, but for the moment only elf64 and WIN64 output formats are supported.
The standard calling convention is followed mostly in the LINUX style, on Windows the parameters are moved around accordingly. The same assembler-source-code can be used on both platforms.
I have updated the makefile to include the nasm command and the nasm flags, but I need help to make these based on configure.
I also have no knowledge on other operating systems (MAC-OS etc.)
The calling conventions can be easily adopted if they differ but somebody else should jump in for testing.
As I implied earlier, this is way too low-level. If we have to worry
about obj formats and calling conventions, we'd better be getting
something *really* amazing in return.
But I really need help by an expert to integrate it in the perl building process.
I would much appreciate if someone else could jump in for a patch to configure-integration and another patch for .vcxobj integration.
It's a bit presumptuous to enlist others for specific help without
general agreement on the design, especially on the most tedious parts.
Also, here's a general engineering tip: If the non-fun part is too
complex for you to figure out, that might indicate the fun part is too
ambitious. I suggest starting with a simple patch with SSE2 (always
present on x86-64) intrinsics, one that anyone can apply and test
without any additional work. Then we can evaluate if the speed-up in
the hex encoding case is worth some additional complexity. As part of
that work, it might be good to see if some portable improved algorithm
is already available somewhere.
There is much room for other implementations (checksum verification/setting, aggregation, numeric datatype, merging, generate_series, integer and floating point output …) which could be addressed later on.
Float output has already been pretty well optimized. CRC checksums
already have a hardware implementation on x86 and Arm. I don't know of
any practical workload where generate_series() is too slow.
Aggregation is an interesting case, but I'm not sure what the current
bottlenecks are.
--
John Naylor
EDB: http://www.enterprisedb.com