Optimize numeric multiplication for one and two base-NBASE digit multiplicands.
Hello hackers,
Attached patch introduces an optimization of mul_var() in numeric.c, targeting cases where the multiplicands consist of only one or two base-NBASE digits. Such small multiplicands can fit into an int64 and thus be computed directly, resulting in a significant performance improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.
This optimization is similar to commit d1b307eef2, that also targeted one and two base-NBASE digit operands, but optimized div_var().
Regards,
Joel
Attachments:
0001-optimize-numeric-mul_var-small-factors.patchapplication/octet-stream; name="=?UTF-8?Q?0001-optimize-numeric-mul=5Fvar-small-factors.patch?="Download+52-1
On Mon, Jul 1, 2024, at 08:04, Joel Jacobson wrote:
* 0001-optimize-numeric-mul_var-small-factors.patch
New version to silence maybe-uninitialized error reported by cfbot.
/Joel
Attachments:
v2-0001-optimize-numeric-mul_var-small-factors.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-optimize-numeric-mul=5Fvar-small-factors.patch?="Download+52-1
"Joel Jacobson" <joel@compiler.org> writes:
Hello hackers,
Attached patch introduces an optimization of mul_var() in numeric.c,
targeting cases where the multiplicands consist of only one or two
base-NBASE digits. Such small multiplicands can fit into an int64 and
thus be computed directly, resulting in a significant performance
improvement, between 26% - 34% benchmarked on Intel Core i9-14900K.This optimization is similar to commit d1b307eef2, that also targeted
one and two base-NBASE digit operands, but optimized div_var().
div_var() also has an optimisation for 3- and 4-digit operands under
HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
mul_var() too?
Regards,
Joel
- ilmari
On Mon, Jul 1, 2024, at 14:25, Dagfinn Ilmari Mannsåker wrote:
div_var() also has an optimisation for 3- and 4-digit operands under
HAVE_INT128 (added in commit 0aa38db56bf), would that make sense in
mul_var() too?
I considered it, but it only gives a marginal speed-up on Intel Core i9-14900K,
and is actually slower on Apple M3 Max.
Not really sure why. Maybe the code I tried can be optimized further:
```
#ifdef HAVE_INT128
/*
* If var1 and var2 are up to four digits, their product will fit in
* an int128 can be computed directly, which is significantly faster.
*/
if (var2ndigits <= 4)
{
int128 product = 0;
switch (var1ndigits)
{
case 1:
product = var1digits[0];
break;
case 2:
product = var1digits[0] * NBASE + var1digits[1];
break;
case 3:
product = ((int128) var1digits[0] * NBASE + var1digits[1])
* NBASE + var1digits[2];
break;
case 4:
product = (((int128) var1digits[0] * NBASE + var1digits[1])
* NBASE + var1digits[2]) * NBASE + var1digits[3];
break;
}
switch (var2ndigits)
{
case 1:
product *= var2digits[0];
break;
case 2:
product *= var2digits[0] * NBASE + var2digits[1];
break;
case 3:
product = ((int128) var2digits[0] * NBASE + var2digits[1])
* NBASE + var2digits[2];
break;
case 4:
product = (((int128) var2digits[0] * NBASE + var2digits[1])
* NBASE + var2digits[2]) * NBASE + var2digits[3];
break;
}
alloc_var(result, res_ndigits);
res_digits = result->digits;
for (i = res_ndigits - 1; i >= 0; i--)
{
res_digits[i] = product % NBASE;
product /= NBASE;
}
Assert(product == 0);
/*
* Finally, round the result to the requested precision.
*/
result->weight = res_weight;
result->sign = res_sign;
/* Round to target rscale (and set result->dscale) */
round_var(result, rscale);
/* Strip leading and trailing zeroes */
strip_var(result);
return;
}
#endif
```
Benchmark 1, testing 2 ndigits * 2 ndigits:
SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_patched',
input_values := ARRAY[
'11112222',
'33334444'
],
min_time := 1000000,
timeout := '10 s'
);
Apple M3 Max:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
32.2 ns | 20.5 ns | 1.5700112246809388
(1 row)
Intel Core i9-14900K:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
30.2 ns | 21.4 ns | 1.4113042510107371
(1 row)
So 57% and 41% faster.
Benchmark 2, testing 4 ndigits * 4 ndigits:
SELECT
timeit.pretty_time(total_time_a / 1e6 / executions,3) AS execution_time_a,
timeit.pretty_time(total_time_b / 1e6 / executions,3) AS execution_time_b,
total_time_a::numeric/total_time_b AS performance_ratio
FROM timeit.cmp(
'numeric_mul',
'numeric_mul_patched',
input_values := ARRAY[
'1111222233334444',
'5555666677778888'
],
min_time := 1000000,
timeout := '10 s'
);
Apple M3 Max:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+------------------------
41.9 ns | 51.3 ns | 0.81733655797170943614
(1 row)
Intel Core i9-14900K:
execution_time_a | execution_time_b | performance_ratio
------------------+------------------+--------------------
40 ns | 38 ns | 1.0515610914706320
(1 row)
So 18% slower on Apple M3 Max and just 5% faster on Intel Core i9-14900K.
/Joel
On Mon, Jul 1, 2024, at 15:11, Joel Jacobson wrote:
Not really sure why. Maybe the code I tried can be optimized further:
If anyone want experiment with the int128 version,
here is a patch that adds a separate numeric_mul_patched() function,
so it's easier to benchmark against the unmodified numeric_mul().
/Joel
Attachments:
0001-Optimize-mul_var-for-var2ndigits-4.patchapplication/octet-stream; name="=?UTF-8?Q?0001-Optimize-mul=5Fvar-for-var2ndigits-4.patch?="Download+462-1
On Mon, Jul 1, 2024, at 15:14, Joel Jacobson wrote:
* 0001-Optimize-mul_var-for-var2ndigits-4.patch
Found a typo, fixed in new version.
The int128 version is still slower though,
I wonder if there is something that can be done to speed it up further.
Below is a more realistic benchmark than just microbenchmarking mul_var(),
not testing the int128 version, but the code for up to 2 NBASE-digits:
```
CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);
SELECT SUM(num1*num2) FROM bench_mul_var;
Time: 8331.953 ms (00:08.332)
Time: 7415.241 ms (00:07.415)
Time: 7298.296 ms (00:07.298)
Time: 7314.754 ms (00:07.315)
Time: 7289.560 ms (00:07.290)
SELECT SUM(numeric_mul_patched(num1,num2)) FROM bench_mul_var;
Time: 6403.426 ms (00:06.403)
Time: 6401.797 ms (00:06.402)
Time: 6366.136 ms (00:06.366)
Time: 6376.049 ms (00:06.376)
Time: 6317.282 ms (00:06.317)
``
Benchmarked on a Intel Core i9-14900K machine.
/Joel
Attachments:
v2-0001-Optimize-mul_var-for-var2ndigits-4.patchapplication/octet-stream; name="=?UTF-8?Q?v2-0001-Optimize-mul=5Fvar-for-var2ndigits-4.patch?="Download+462-1
On Mon, 1 Jul 2024 at 20:56, Joel Jacobson <joel@compiler.org> wrote:
Below is a more realistic benchmark
CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);SELECT SUM(num1*num2) FROM bench_mul_var;
I had a play with this, and came up with a slightly different way of
doing it that works for var2 of any size, as long as var1 is just 1 or
2 digits.
Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:
SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 4762.990 ms (00:04.763)
Time: 4332.166 ms (00:04.332)
Time: 4276.211 ms (00:04.276)
Time: 4247.321 ms (00:04.247)
Time: 4166.738 ms (00:04.167)
SELECT SUM(num1*num2) FROM bench_mul_var; -- v2 patch
Time: 4398.812 ms (00:04.399)
Time: 3672.668 ms (00:03.673)
Time: 3650.227 ms (00:03.650)
Time: 3611.420 ms (00:03.611)
Time: 3534.218 ms (00:03.534)
SELECT SUM(num1*num2) FROM bench_mul_var; -- this patch
Time: 3350.596 ms (00:03.351)
Time: 3336.224 ms (00:03.336)
Time: 3335.599 ms (00:03.336)
Time: 3336.990 ms (00:03.337)
Time: 3351.453 ms (00:03.351)
(This was on an older Intel Core i9-9900K, so I'm not sure why all the
timings are faster. What compiler settings are you using?)
The approach taken in this patch only uses 32-bit integers, so in
theory it could be extended to work for var1ndigits = 3, 4, or even
more, but the code would get increasingly complex, and I ran out of
steam at 2 digits. It might be worth trying though.
Regards,
Dean
Attachments:
optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txttext/plain; charset=US-ASCII; name=optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txtDownload+68-0
On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:
I had a play with this, and came up with a slightly different way of
doing it that works for var2 of any size, as long as var1 is just 1 or
2 digits.Repeating your benchmark where both numbers have up to 2 NBASE-digits,
this new approach was slightly faster:
...
(This was on an older Intel Core i9-9900K, so I'm not sure why all the
timings are faster. What compiler settings are you using?)
Strange. I just did `./configure` with a --prefix.
Compiler settings on my Intel Core i9-14900K machine:
$ pg_config | grep -E '^(CC|CFLAGS|CPPFLAGS|LDFLAGS)'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/home/joel/pg-dev/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
The approach taken in this patch only uses 32-bit integers, so in
theory it could be extended to work for var1ndigits = 3, 4, or even
more, but the code would get increasingly complex, and I ran out of
steam at 2 digits. It might be worth trying though.Regards,
DeanAttachments:
* optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Really nice!
I've benchmarked your patch on my three machines with great results.
I added a setseed() step, to make the benchmarks reproducible,
shouldn't matter much since it should statistically average out, but I thought why not.
CREATE TABLE bench_mul_var (num1 numeric, num2 numeric);
SELECT setseed(0.12345);
INSERT INTO bench_mul_var (num1, num2)
SELECT random(0::numeric,1e8::numeric), random(0::numeric,1e8::numeric) FROM generate_series(1,1e8);
\timing
/*
* Apple M3 Max
*/
SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 3622.342 ms (00:03.622)
Time: 3029.786 ms (00:03.030)
Time: 3046.497 ms (00:03.046)
Time: 3035.910 ms (00:03.036)
Time: 3034.073 ms (00:03.034)
SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2484.685 ms (00:02.485)
Time: 2478.341 ms (00:02.478)
Time: 2494.397 ms (00:02.494)
Time: 2470.987 ms (00:02.471)
Time: 2490.215 ms (00:02.490)
/*
* Intel Core i9-14900K
*/
SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 2555.569 ms (00:02.556)
Time: 2523.145 ms (00:02.523)
Time: 2518.671 ms (00:02.519)
Time: 2514.501 ms (00:02.515)
Time: 2516.919 ms (00:02.517)
SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2246.441 ms (00:02.246)
Time: 2243.900 ms (00:02.244)
Time: 2245.350 ms (00:02.245)
Time: 2245.080 ms (00:02.245)
Time: 2247.856 ms (00:02.248)
/*
* AMD Ryzen 9 7950X3D
*/
SELECT SUM(num1*num2) FROM bench_mul_var; -- HEAD
Time: 3037.497 ms (00:03.037)
Time: 3010.037 ms (00:03.010)
Time: 3000.956 ms (00:03.001)
Time: 2989.424 ms (00:02.989)
Time: 2984.911 ms (00:02.985)
SELECT SUM(num1*num2) FROM bench_mul_var; -- optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Time: 2645.530 ms (00:02.646)
Time: 2640.472 ms (00:02.640)
Time: 2638.613 ms (00:02.639)
Time: 2637.889 ms (00:02.638)
Time: 2638.054 ms (00:02.638)
/Joel
On Tue, 2 Jul 2024 at 08:50, Joel Jacobson <joel@compiler.org> wrote:
On Tue, Jul 2, 2024, at 00:19, Dean Rasheed wrote:
Attachments:
* optimize-numeric-mul_var-small-var1-arbitrary-var2.patch.txt
Shortly after posting that, I realised that there was a small bug. This bit:
case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
That doesn't noticeably affect the performance though. Update attached.
I've benchmarked your patch on my three machines with great results.
I added a setseed() step, to make the benchmarks reproducible,
shouldn't matter much since it should statistically average out, but I thought why not.
Nice. The results on the Apple M3 Max look particularly impressive.
I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.
Regards,
Dean
Attachments:
v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchtext/x-patch; charset=US-ASCII; name=v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchDownload+70-0
On Tue, Jul 2, 2024, at 10:22, Dean Rasheed wrote:
Shortly after posting that, I realised that there was a small bug. This bit:
case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];That doesn't noticeably affect the performance though. Update attached.
Nice catch. Could we add a test somehow that would test mul_var()
with rscale less than the full result, that would catch bugs like this one
and others?
I created a new benchmark, that specifically tests different var1ndigits.
I've only run it on Apple M3 Max yet. More to come.
\timing
SELECT setseed(0.12345);
CREATE TABLE bench_mul_var_var1ndigits_1 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_1 (var1, var2)
SELECT random(0::numeric,9999::numeric), random(10000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_2 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_2 (var1, var2)
SELECT random(10000000::numeric,99999999::numeric), random(100000000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_3 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_3 (var1, var2)
SELECT random(100000000000::numeric,999999999999::numeric), random(1000000000000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
CREATE TABLE bench_mul_var_var1ndigits_4 (var1 numeric, var2 numeric);
INSERT INTO bench_mul_var_var1ndigits_4 (var1, var2)
SELECT random(1000000000000000::numeric,9999999999999999::numeric), random(10000000000000000000::numeric,1e32::numeric) FROM generate_series(1,1e8);
/*
* Apple M3 Max
*/
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- HEAD
Time: 2986.952 ms (00:02.987)
Time: 2991.765 ms (00:02.992)
Time: 2987.253 ms (00:02.987)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_1; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 2874.841 ms (00:02.875)
Time: 2883.070 ms (00:02.883)
Time: 2899.973 ms (00:02.900)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- HEAD
Time: 3459.556 ms (00:03.460)
Time: 3304.983 ms (00:03.305)
Time: 3299.728 ms (00:03.300)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_2; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3053.140 ms (00:03.053)
Time: 3065.227 ms (00:03.065)
Time: 3069.995 ms (00:03.070)
/*
* Just for completeness, also testing var1ndigits 3 and 4,
* although no change is expected since not yet implemented.
*/
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3809.005 ms (00:03.809)
Time: 3438.260 ms (00:03.438)
Time: 3453.920 ms (00:03.454)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3437.592 ms (00:03.438)
Time: 3457.586 ms (00:03.458)
Time: 3474.344 ms (00:03.474)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- HEAD
Time: 4133.193 ms (00:04.133)
Time: 3554.477 ms (00:03.554)
Time: 3560.855 ms (00:03.561)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_4; -- v2-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3508.540 ms (00:03.509)
Time: 3566.721 ms (00:03.567)
Time: 3524.083 ms (00:03.524)
I think it'd probably be worth trying to extend this to 3 or maybe 4
var1 digits, since that would cover a lot of "everyday" sized numeric
values that a lot of people might be using. I don't think we should go
beyond that though.
I think so too. I'm working on var1ndigits=3 now.
/Joel
On Tue, Jul 2, 2024, at 11:05, Joel Jacobson wrote:
On Tue, Jul 2, 2024, at 10:22, Dean Rasheed wrote:
Shortly after posting that, I realised that there was a small bug. This bit:
case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];isn't quite right in the case where rscale is less than the full
result. In that case, the least significant result digit has a
contribution from one more var2 digit, so it needs to be:newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
Just to be able to verify mul_var() is working as expected when
rscale is less than the full result, I've added a numeric_mul_patched()
function that takes a third rscale_adjustment parameter:
I've tried to get a different result with and without the fix,
but I'm failing to hit the bug.
Can you think of an example that should trigger the bug?
Example usage:
SELECT numeric_mul_patched(9999.999,2,0);
var1ndigits=1 var2ndigits=2 rscale=3
numeric_mul_patched
---------------------
19999.998
(1 row)
SELECT numeric_mul_patched(9999.999,2,1);
var1ndigits=1 var2ndigits=2 rscale=4
numeric_mul_patched
---------------------
19999.9980
(1 row)
SELECT numeric_mul_patched(9999.999,2,-1);
var1ndigits=1 var2ndigits=2 rscale=2
numeric_mul_patched
---------------------
20000.00
(1 row)
SELECT numeric_mul_patched(9999.999,2,-2);
var1ndigits=1 var2ndigits=2 rscale=1
numeric_mul_patched
---------------------
20000.0
(1 row)
/Joel
On Tue, 2 Jul 2024 at 11:23, Joel Jacobson <joel@compiler.org> wrote:
Just to be able to verify mul_var() is working as expected when
rscale is less than the full result, I've added a numeric_mul_patched()
function that takes a third rscale_adjustment parameter:
Yeah, we could expose such a function, and maybe it would be generally
useful, not just for testing, but that's really a separate proposal.
In general, mul_var() with reduced rscale doesn't guarantee correctly
rounded results though, which might make it less generally acceptable.
For this patch though, the aim is just to ensure the results are the
same as before.
I've tried to get a different result with and without the fix,
but I'm failing to hit the bug.Can you think of an example that should trigger the bug?
9999.0001 * 5000.9999_9999 with rscale = 0 triggers it (returned
50004999 instead of 50005000).
Regards,
Dean
On Tue, Jul 2, 2024, at 13:44, Dean Rasheed wrote:
Can you think of an example that should trigger the bug?
9999.0001 * 5000.9999_9999 with rscale = 0 triggers it (returned
50004999 instead of 50005000).
Thanks, helpful.
Attached patch adds the var1ndigits=3 case.
Benchmark:
/*
* Apple M3 Max
*/
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3809.005 ms (00:03.809)
Time: 3438.260 ms (00:03.438)
Time: 3453.920 ms (00:03.454)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3230.017 ms (00:03.230)
Time: 3244.094 ms (00:03.244)
Time: 3246.370 ms (00:03.246)
/*
* Intel Core i9-14900K
*/
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 4082.759 ms (00:04.083)
Time: 4077.372 ms (00:04.077)
Time: 4080.830 ms (00:04.081)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3989.021 ms (00:03.989)
Time: 3978.263 ms (00:03.978)
Time: 3955.331 ms (00:03.955)
/*
* AMD Ryzen 9 7950X3D
*/
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- HEAD
Time: 3791.271 ms (00:03.791)
Time: 3760.138 ms (00:03.760)
Time: 3758.773 ms (00:03.759)
SELECT SUM(var1*var2) FROM bench_mul_var_var1ndigits_3; -- v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Time: 3400.824 ms (00:03.401)
Time: 3373.714 ms (00:03.374)
Time: 3345.505 ms (00:03.346)
Regards,
Joel
Attachments:
v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchapplication/octet-stream; name="=?UTF-8?Q?v3-optimize-numeric-mul=5Fvar-small-var1-arbitrary-var2.patch?="Download+100-1
On Tue, Jul 2, 2024, at 18:20, Joel Jacobson wrote:
* v3-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Hmm, v3 contains a bug which I haven't been able to solve yet.
Reporting now to avoid time waste reviewing it since it's buggy.
The attached patch is how I tested and found the bug.
It contains a file test-mul-var.sql, which tests mul_var()
with varying rscale, using the SQL-callable numeric_mul_patched(),
which third argument is the rscale_adjustment.
Out of 2481600 random tests, all passed except 4:
SELECT
result = numeric_mul_patched(var1,var2,rscale_adjustment),
COUNT(*)
FROM test_numeric_mul_patched
GROUP BY 1
ORDER BY 1;
?column? | count
----------+---------
f | 4
t | 2481596
(2 rows)
SELECT
var1,
var2,
var1*var2 AS full_resolution,
rscale_adjustment,
result AS expected,
numeric_mul_patched(var1,var2,rscale_adjustment) AS actual
FROM test_numeric_mul_patched
WHERE result <> numeric_mul_patched(var1,var2,rscale_adjustment);
var1 | var2 | full_resolution | rscale_adjustment | expected | actual
-------------------+----------------+---------------------------+-------------------+----------+--------
676.797214075 | 0.308068877759 | 208.500158210502929257925 | -21 | 209 | 208
555.07029 | 0.381033094735 | 211.50015039415392315 | -17 | 212 | 211
0.476637718921 | 66.088276 | 31.500165120061470196 | -18 | 32 | 31
0.060569165063082 | 998.85933 | 60.50007563356949425506 | -20 | 61 | 60
(4 rows)
Trying to wrap my head around what could cause this.
It's rounding down instead of up, and these cases all end with decimal .500XXXX.
Regards,
Joel
Attachments:
numeric_mul_patched.txttext/plain; name=numeric_mul_patched.txtDownload+491-1
On Tue, Jul 2, 2024, at 20:53, Joel Jacobson wrote:
Trying to wrap my head around what could cause this.
It's rounding down instead of up, and these cases all end with decimal .500XXXX.
Interesting, I actually think there is a bug in the normal mul_var() code.
Found a case that rounds down, when it should round up:
Calling mul_var() with:
var1=51.2945442386599
var2=0.828548712212
rscale=0
returns 42, but I think it should return 43,
since 51.2945442386599*0.828548712212=42.5000285724431241296446988
But maybe this is expected and OK, having to do with MUL_GUARD_DIGITS?
/Joel
On Tue, Jul 2, 2024, at 21:55, Joel Jacobson wrote:
On Tue, Jul 2, 2024, at 20:53, Joel Jacobson wrote:
Trying to wrap my head around what could cause this.
I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:
case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
The problem here is that res_ndigits could become less than 4,
if rscale is low enough,
and then we would try to access a negative array index in var2digits.
Fix:
case 2:
if (res_ndigits - 4 >= 0 && res_ndigits - 4 < var2ndigits)
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];
else
newdig = 0;
Another problem in the case 2 code:
if (res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
Fix:
if (res_ndigits - 3 >= 0 && res_ndigits - 3 < var2ndigits)
newdig += (int) var1digits[0] * var2digits[res_ndigits - 3];
New version attached that fixes both the case 2 and case 3 code.
Regards,
Joel
Attachments:
v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patchapplication/octet-stream; name="=?UTF-8?Q?v4-optimize-numeric-mul=5Fvar-small-var1-arbitrary-var2.patch?="Download+106-1
On Tue, Jul 2, 2024, at 22:10, Joel Jacobson wrote:
* v4-optimize-numeric-mul_var-small-var1-arbitrary-var2.patch
Instead of these boundary checks, maybe it would be cleaner to just
skip this optimization if there are too few res_ndigits?
/Joel
On Tue, 2 Jul 2024 at 20:55, Joel Jacobson <joel@compiler.org> wrote:
Interesting, I actually think there is a bug in the normal mul_var() code.
Found a case that rounds down, when it should round up:Calling mul_var() with:
var1=51.2945442386599
var2=0.828548712212
rscale=0returns 42, but I think it should return 43,
since 51.2945442386599*0.828548712212=42.5000285724431241296446988But maybe this is expected and OK, having to do with MUL_GUARD_DIGITS?
No, that's not a bug. It's to be expected. The point of using only
MUL_GUARD_DIGITS extra digits is to get the correctly rounded result
*most of the time*, while saving a lot of effort by not computing the
full result.
The callers of mul_var() that ask for reduced rscale results are the
transcendental functions like ln_var() and exp_var(), which are
working to some limited precision intended to compute the final result
reasonably accurately to a particular rscale.
Regards,
Dean
On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:
I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:case 2:
newdig = (int) var1digits[1] * var2digits[res_ndigits - 4];The problem here is that res_ndigits could become less than 4,
Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.
I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...
I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.
Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.
Regards,
Dean
Em qua., 3 de jul. de 2024 às 08:18, Dean Rasheed <dean.a.rasheed@gmail.com>
escreveu:
On Tue, 2 Jul 2024 at 21:10, Joel Jacobson <joel@compiler.org> wrote:
I found the bug in the case 3 code,
and it turns out the same type of bug also exists in the case 2 code:case 2:
newdig = (int) var1digits[1] *var2digits[res_ndigits - 4];
The problem here is that res_ndigits could become less than 4,
Yes. It can't be less than 3 though (per an earlier test), so the case
2 code was correct.I've been hacking on this a bit and trying to tidy it up. Firstly, I
moved it to a separate function, because it was starting to look messy
having so much extra code in mul_var(). Then I added a bunch more
comments to explain what's going on, and the limits of the various
variables. Note that most of the boundary checks are actually
unnecessary -- in particular all the ones in or after the main loop,
provided you pull out the first 2 result digits from the main loop in
the 3-digit case. That does seem to work very well, but...I wasn't entirely happy with how messy that code is getting, so I
tried a different approach. Similar to div_var_int(), I tried writing
a mul_var_int() function instead. This can be used for 1 and 2 digit
factors, and we could add a similar mul_var_int64() function on
platforms with 128-bit integers. The code looks quite a lot neater, so
it's probably less likely to contain bugs (though I have just written
it in a hurry,so it might still have bugs). In testing, it seemed to
give a decent speedup, but perhaps a little less than before. But
that's to be balanced against having more maintainable code, and also
a function that might be useful elsewhere in numeric.c.Anyway, here are both patches for comparison. I'll stop hacking for a
while and let you see what you make of these.
I liked v5-add-mul_var_int.patch better.
I think that *var_digits can be const too.
+ const NumericDigit *var_digits = var->digits;
Typo In the comments:
- by procssing
+ by processing
best regards,
Ranier Vilela