Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Hi.
In the functions *systable_beginscan* and *systable_beginscan_ordered*,
is possible a small optimization.
The array *idxkey* can be constructed in one go with a single call to
mempcy.
The excess might not make much of a difference, but I think it's worth the
effort.
patch attached.
best regards,
Ranier Vilela
Attachments:
avoid-multiple-calls-to-memcpy-genam.patchapplication/octet-stream; name=avoid-multiple-calls-to-memcpy-genam.patchDownload+4-4
Em seg., 9 de mar. de 2026 às 10:16, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Hi.
In the functions *systable_beginscan* and *systable_beginscan_ordered*,
is possible a small optimization.
The array *idxkey* can be constructed in one go with a single call to
mempcy.
The excess might not make much of a difference, but I think it's worth the
effort.patch attached.
Someone asked me if O2 does not do the work.
Apparently not.
https://godbolt.org/z/h5dndz33x
best regards,
Ranier Vilela
I created an example that is a little bit closer to the actual code and
changed the compiler from C++ to C.
It is interesting the optimization that the compiler has chosen for version
1 versus version 2. One calls
memcpy and one doesn't. There is a good chance the inlining of memcpy as
SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small (1-4
keys?).
Probably the only reason to do this patch would be if N is normally large
or if this is considered an
improvement in code clarity without a detrimental impact on small N
syscache scans.
I realize you only said "possible small optimization". It might be
worthwhile to benchmark the code for
different values of n to determine if there is a tipping point either way?
https://godbolt.org/z/dM18cGfE6
-- bg
On Mon, Mar 9, 2026 at 8:05 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
Show quoted text
Em seg., 9 de mar. de 2026 às 10:16, Ranier Vilela <ranier.vf@gmail.com>
escreveu:Hi.
In the functions *systable_beginscan* and *systable_beginscan_ordered*,
is possible a small optimization.
The array *idxkey* can be constructed in one go with a single call to
mempcy.
The excess might not make much of a difference, but I think it's worth
the effort.patch attached.
Someone asked me if O2 does not do the work.
Apparently not.https://godbolt.org/z/h5dndz33x
best regards,
Ranier Vilela
Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan.green@gmail.com>
escreveu:
I created an example that is a little bit closer to the actual code and
changed the compiler from C++ to C.It is interesting the optimization that the compiler has chosen for
version 1 versus version 2. One calls
memcpy and one doesn't. There is a good chance the inlining of memcpy as
SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small (1-4
keys?).
I doubt the inline version is better.
Clang is supported too and the code generated is much better with memcpy
one call outside of the loop.
Probably the only reason to do this patch would be if N is normally large
or if this is considered an
improvement in code clarity without a detrimental impact on small N
syscache scans.
I realize you only said "possible small optimization". It might be
worthwhile to benchmark the code for
different values of n to determine if there is a tipping point either way?
In your opinion, shouldn't this be considered an optimization, even a
small one?
best regards,
Ranier Vilela
I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
wins for small values of n.
20 runs:
n version min median mean max stddev noise%
-----------------------------------------------------------------------
n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%
n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%
n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%
But, micro-benchmarks always make me nervous, so I looked at the actual
instruction cost for my
platform given the version 1 and version 2 code.
If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
tables: version 1 (loop body)
has a critical path of ~5-6 cycles per iteration. version 2 (loop body)
has ~3-4 cycles per iteration.
The problem for version 2 is that the call to memcpy is ~24-30 cycles due
to the stub + function call + return
and branch predictor pressure on first call. This probably results in ~2.5
ns per iteration cost for version 2.
So, no I wouldn't call it an optimization. But, it will be interesting to
hear other opinions on this.
--bg
On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
Show quoted text
Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan.green@gmail.com>
escreveu:I created an example that is a little bit closer to the actual code and
changed the compiler from C++ to C.It is interesting the optimization that the compiler has chosen for
version 1 versus version 2. One calls
memcpy and one doesn't. There is a good chance the inlining of memcpy as
SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small
(1-4 keys?).I doubt the inline version is better.
Clang is supported too and the code generated is much better with memcpy
one call outside of the loop.Probably the only reason to do this patch would be if N is normally large
or if this is considered an
improvement in code clarity without a detrimental impact on small N
syscache scans.
I realize you only said "possible small optimization". It might be
worthwhile to benchmark the code for
different values of n to determine if there is a tipping point either way?In your opinion, shouldn't this be considered an optimization, even a
small one?best regards,
Ranier Vilela
I meant to add that the micro-benchmark went through each loop 100,000,000
iterations for each run...
for 20 runs to come up with those numbers.
--bg
On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <dbryan.green@gmail.com> wrote:
Show quoted text
I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
wins for small values of n.20 runs:
n version min median mean max stddev noise%
-----------------------------------------------------------------------
n=1 version1 2.440 2.440 2.450 2.550 0.024 4.5%
n=1 version2 4.260 4.280 4.277 4.290 0.007 0.7%n=2 version1 2.740 2.750 2.757 2.880 0.029 5.1%
n=2 version2 3.970 3.980 3.980 4.020 0.010 1.3%n=4 version1 4.580 4.595 4.649 4.910 0.094 7.2%
n=4 version2 5.780 5.815 5.809 5.820 0.013 0.7%But, micro-benchmarks always make me nervous, so I looked at the actual
instruction cost for my
platform given the version 1 and version 2 code.If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
tables: version 1 (loop body)
has a critical path of ~5-6 cycles per iteration. version 2 (loop body)
has ~3-4 cycles per iteration.The problem for version 2 is that the call to memcpy is ~24-30 cycles due
to the stub + function call + return
and branch predictor pressure on first call. This probably results in
~2.5 ns per iteration cost for version 2.So, no I wouldn't call it an optimization. But, it will be interesting to
hear other opinions on this.--bg
On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <ranier.vf@gmail.com> wrote:
Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <dbryan.green@gmail.com>
escreveu:I created an example that is a little bit closer to the actual code and
changed the compiler from C++ to C.It is interesting the optimization that the compiler has chosen for
version 1 versus version 2. One calls
memcpy and one doesn't. There is a good chance the inlining of memcpy
as SSE+scalar per iteration
will be faster for syscache scans-- which I believe are usually small
(1-4 keys?).I doubt the inline version is better.
Clang is supported too and the code generated is much better with memcpy
one call outside of the loop.Probably the only reason to do this patch would be if N is normally
large or if this is considered an
improvement in code clarity without a detrimental impact on small N
syscache scans.
I realize you only said "possible small optimization". It might be
worthwhile to benchmark the code for
different values of n to determine if there is a tipping point either
way?In your opinion, shouldn't this be considered an optimization, even a
small one?best regards,
Ranier Vilela