Some optimisations for numeric division
Attached are 3 small patches that improve the performance of numeric
division. Patch 0002 seems to have the biggest impact, but I think
they're all worth including, since they're quite simple changes, with
noticeable performance gains.
Patch 0001 vectorises the inner loop of div_var_fast(). This loop is
almost identical to the inner loop of mul_var(), which was vectorised
by commit 8870917623. The thing preventing the div_var_fast() loop
from being vectorised is the assignment to div[qi + i], and simply
replacing that with div_qi[i] where div_qi = &div[qi], in the same
style as mul_var(), fixes that.
One difference between this and the mul_var() code is that it is also
necessary to add an explicit cast to get the compiler to generate
matching output, and give the best results. This is because in
mul_var() the compiler recognises that var1digit is actually 16-bit,
rather than 32-bit, and so it doesn't need to multiply the high part,
but in div_var_fast() it's not obvious to the compiler that qdigit
also fits in 16 bits, hence the cast.
The actual performance gain is typically quite modest, since
div_var_fast() is always only a part of some more complex numeric
computation, but it can be seen in cases like raising numbers to
negative integer powers, e.g.:
CREATE TEMP TABLE num_test(x numeric);
SELECT setseed(0);
INSERT INTO num_test
SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, '')||x)::numeric
FROM generate_series(1,100))
FROM generate_series(1,100000) g(x);
SELECT sum(x^(-2)) FROM num_test;
Time: 112.949 ms (HEAD)
Time: 98.537 ms (with patch)
Patch 0002 simplifies the inner loop of div_var() (the guts of the
public-facing numeric division operator) by more closely combining the
multiplication and subtraction operations and folding the separate
"carry" and "borrow" variables into a single "borrow", as suggested by
the old code comment.
IMO, this makes the code easier to understand, as well as giving more
significant performance gains:
CREATE TEMP TABLE div_test(x numeric);
SELECT setseed(0);
INSERT INTO div_test
SELECT (SELECT ('1.'||string_agg((random()*9)::int::text, ''))::numeric + x
FROM generate_series(1,50))
FROM generate_series(1,1000) g(x);
SELECT sum(x/y) FROM div_test t1(x), div_test t2(y);
Time: 1477.340 ms (HEAD)
Time: 826.748 ms (with patch)
Patch 0003 replaces some uses of div_var_fast() with div_var().
Specifically, when the divisor has just one or two base-NBASE digits,
div_var() is faster. This is especially true for 1-digit divisors, due
to the "fast path" code in div_var() to handle that. It's also just
about true for 2-digit divisors, and it occurs to me that that case
could potentially be optimised further with similar fast path code in
div_var(), but I haven't tried that yet.
One-digit divisors are quite common. For example, in the Taylor series
expansions in exp_var() and ln_var(), since the number of Taylor
series terms never exceeds a few hundred in practice. Also,
exp_var()'s argument reduction step divides by 2^ndiv2, which is
roughly 100 times the input, rounded up to a power of two, and valid
inputs are less than 6000, so this will always be one or two digits.
Testing this with a bunch of random exp() and ln() computations I saw
a speed-up of 15-20%, and it reduced the run time of the numeric-big
regression test by around 10%, which seems worth having.
Regards,
Dean
Attachments:
0001-Apply-auto-vectorization-to-the-inner-loop-of-div_va.patchtext/x-patch; charset=US-ASCII; name=0001-Apply-auto-vectorization-to-the-inner-loop-of-div_va.patchDownload+10-2
0003-Use-div_var-instead-of-div_var_fast-when-the-divisor.patchtext/x-patch; charset=US-ASCII; name=0003-Use-div_var-instead-of-div_var_fast-when-the-divisor.patchDownload+14-5
0002-Simplify-the-inner-loop-of-numeric-division-in-div_v.patchtext/x-patch; charset=US-ASCII; name=0002-Simplify-the-inner-loop-of-numeric-division-in-div_v.patchDownload+15-22
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
Attached are 3 small patches that improve the performance of numeric
division. Patch 0002 seems to have the biggest impact, but I think
they're all worth including, since they're quite simple changes, with
noticeable performance gains.
I took a quick look through these (just eyeball, didn't try to verify
your performance statements). I'm +1 on 0001 and 0002, but 0003 feels
a bit ad-hoc. It certainly *looks* weird for the allegedly faster
function to be handing off to the allegedly slower one. I also wonder
if we're leaving anything on the table by not exploiting
div_var_fast's weaker roundoff guarantees in this case. Should we
think about a more thoroughgoing redesign of these functions' APIs?
Another idea is to only worry about the single-divisor-digit
optimization, and just copy div_var's (very small) inner loop for that
case into div_var_fast.
regards, tom lane
On Wed, 23 Feb 2022 at 20:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I took a quick look through these (just eyeball, didn't try to verify
your performance statements).
Thanks for looking!
I'm +1 on 0001 and 0002, but 0003 feels
a bit ad-hoc. It certainly *looks* weird for the allegedly faster
function to be handing off to the allegedly slower one. I also wonder
if we're leaving anything on the table by not exploiting
div_var_fast's weaker roundoff guarantees in this case. Should we
think about a more thoroughgoing redesign of these functions' APIs?
Hmm, I'm not sure what kind of thing you had in mind.
One thought that occurred to me was that it's a bit silly that
exp_var() and ln_var() have to use a NumericVar for what could just be
an int, if we had a div_var_int() function that could divide by an
int. Then both div_var() and div_var_fast() could hand off to it for
one and two digit divisors.
Regards,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
On Wed, 23 Feb 2022 at 20:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm +1 on 0001 and 0002, but 0003 feels
a bit ad-hoc. It certainly *looks* weird for the allegedly faster
function to be handing off to the allegedly slower one. I also wonder
if we're leaving anything on the table by not exploiting
div_var_fast's weaker roundoff guarantees in this case. Should we
think about a more thoroughgoing redesign of these functions' APIs?
Hmm, I'm not sure what kind of thing you had in mind.
I'm not either, tbh. Just seems like this needs more than some
hacking around the margins.
One thought that occurred to me was that it's a bit silly that
exp_var() and ln_var() have to use a NumericVar for what could just be
an int, if we had a div_var_int() function that could divide by an
int. Then both div_var() and div_var_fast() could hand off to it for
one and two digit divisors.
Oooh, that seems like a good idea.
regards, tom lane
On Wed, 23 Feb 2022 at 22:55, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
One thought that occurred to me was that it's a bit silly that
exp_var() and ln_var() have to use a NumericVar for what could just be
an int, if we had a div_var_int() function that could divide by an
int. Then both div_var() and div_var_fast() could hand off to it for
one and two digit divisors.Oooh, that seems like a good idea.
OK, I've replaced the 0003 patch with one that does that instead. The
div_var_int() API is slightly different in that it also accepts a
divisor weight argument, but the alternative would have been for the
callers to have to adjust both the result weight and rscale, which
would have been uglier.
There's a large block of code in div_var() that needs re-indenting,
but I think it would be better to leave that to a later pgindent run.
The performance results are quite pleasing. It's slightly faster than
the old one-digit div_var() code because it manages to avoid some
digit array copying, and for two digit divisors it's much faster:
CREATE TEMP TABLE div_test(x numeric, y numeric);
SELECT setseed(0);
INSERT INTO div_test
SELECT (SELECT (((x%9)+1)||string_agg((random()*9)::int::text, ''))::numeric
FROM generate_series(1,50)),
(SELECT ('1.'||string_agg((random()*9)::int::text,
'')||(x%10)||'e3')::numeric
FROM generate_series(1,6))
FROM generate_series(1,5000) g(x);
select * from div_test limit 10;
SELECT sum(t1.x/t2.y) FROM div_test t1, div_test t2;
Time: 11600.034 ms (HEAD)
Time: 9890.788 ms (with 0002)
Time: 6202.851 ms (with 0003)
And obviously it'll be a larger relative gain for div_var_fast(),
since that was slower to begin with in such cases.
This makes me think that it might also be worthwhile to follow this
with a similar div_var_int64() function on platforms with 128-bit
integers, which could then be used to handle 3- and 4-digit divisors,
which are probably quite common in practice.
Attached is the updated patch series (0001 and 0002 unchanged).
Regards,
Dean
Attachments:
0002-Simplify-the-inner-loop-of-numeric-division-in-div_v.patchtext/x-patch; charset=US-ASCII; name=0002-Simplify-the-inner-loop-of-numeric-division-in-div_v.patchDownload+15-22
0001-Apply-auto-vectorization-to-the-inner-loop-of-div_va.patchtext/x-patch; charset=US-ASCII; name=0001-Apply-auto-vectorization-to-the-inner-loop-of-div_va.patchDownload+10-2
0003-Optimise-numeric-division-for-one-and-two-base-NBASE.patchtext/x-patch; charset=US-ASCII; name=0003-Optimise-numeric-division-for-one-and-two-base-NBASE.patchDownload+180-44
On Fri, 25 Feb 2022 at 10:45, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
Attached is the updated patch series (0001 and 0002 unchanged).
And another update following feedback from the cfbot.
Regards,
Dean
Attachments:
0001-Apply-auto-vectorization-to-the-inner-loop-of-div_va.patchtext/x-patch; charset=US-ASCII; name=0001-Apply-auto-vectorization-to-the-inner-loop-of-div_va.patchDownload+10-2
0002-Simplify-the-inner-loop-of-numeric-division-in-div_v.patchtext/x-patch; charset=US-ASCII; name=0002-Simplify-the-inner-loop-of-numeric-division-in-div_v.patchDownload+15-22
0003-Optimise-numeric-division-for-one-and-two-base-NBASE.patchtext/x-patch; charset=US-ASCII; name=0003-Optimise-numeric-division-for-one-and-two-base-NBASE.patchDownload+180-44
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
And another update following feedback from the cfbot.
This patchset LGTM. On my machine there doesn't seem to be any
measurable performance change for the numeric regression test,
but numeric_big gets about 15% faster.
The only additional thought I have, based on your comments about
0001, is that maybe in mul_var we should declare var1digit as
being NumericDigit not int. I tried that here and didn't see
any material change in the generated assembly code (using gcc
8.5.0), so you're right that modern gcc doesn't need any help
there; but I wonder if it could help on other compiler versions.
I'll mark this RFC.
regards, tom lane
On Fri, 25 Feb 2022 at 18:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
And another update following feedback from the cfbot.
This patchset LGTM. On my machine there doesn't seem to be any
measurable performance change for the numeric regression test,
but numeric_big gets about 15% faster.
Yes, that matches my observations. Thanks for reviewing and testing.
The only additional thought I have, based on your comments about
0001, is that maybe in mul_var we should declare var1digit as
being NumericDigit not int. I tried that here and didn't see
any material change in the generated assembly code (using gcc
8.5.0), so you're right that modern gcc doesn't need any help
there; but I wonder if it could help on other compiler versions.
Yes, that makes sense. Done that way.
Regards,
Dean