[PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

Started by Joel Jacobsonabout 3 years ago8 messageshackers
Jump to latest
#1Joel Jacobson
joel@compiler.org

Hi,

On platforms where we support 128bit integers, we could accelerate division
when the number of digits in the divisor is larger than 8 and less than or
equal to 16 digits, i.e. when the divisor that fits in a 64-bit integer but would
not fit in a 32-bit integer.

This patch adds div_var_int64(), which is similar to the existing div_var_int(),
but accepts a 64-bit divisor instead of a 32-bit divisor.

The new function is used within div_var() and div_var_fast().

To measure the effect, we need a volatile wrapper function for numeric_div(),
to avoid it being cached since it's immutable:

CREATE OR REPLACE FUNCTION numeric_div_volatile(numeric,numeric)
RETURNS numeric LANGUAGE internal AS 'numeric_div';

We can then use generate_series() to measure the execution time for lots of
executions. This does not account for the overhead of generate_series() and
count(), but that's okay since the overhead is the same in both measurements,
so the relative difference is still correct.

--
-- Division when the divisor is 8 digits should be unchanged:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
repeat('1',131071)::numeric,
repeat('3',8)::numeric
)) FROM generate_series(1,1e4);
-- Execution Time: 1633.722 ms (HEAD)
-- Execution Time: 1680.228 ms (div_var_int64.patch)

--
-- Division when the divisor is 9 digits should be faster:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
repeat('1',131071)::numeric,
repeat('3',9)::numeric
)) FROM generate_series(1,1e4);
-- Execution Time: 5444.755 ms (HEAD)
-- Execution Time: 1604.967 ms (div_var_int64.patch)

--
-- Division when the divisor is 16 digits should also be faster:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
repeat('1',131071)::numeric,
repeat('3',16)::numeric
)) FROM generate_series(1,1e4);
-- Execution Time: 6072.683 ms (HEAD)
-- Execution Time: 3215.686 ms (div_var_int64.patch)

--
-- Division when the divisor is 17 digits should be unchanged:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
repeat('1',131071)::numeric,
repeat('3',17)::numeric
)) FROM generate_series(1,1e4);
-- Execution Time: 6948.150 ms (HEAD)
-- Execution Time: 7010.544 ms (div_var_int64.patch)

--
-- Same tests as above, but with a single digit dividend,
-- and 1e7 executions instead of just 1e4.
--

--
-- Division when the divisor is 8 digits should be unchanged:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
1,
repeat('3',8)::numeric
)) FROM generate_series(1,1e7);
-- Execution Time: 1827.567 ms (HEAD)
-- Execution Time: 1828.029 ms (div_var_int64.patch)

--
-- Division when the divisor is 9 digits should be faster:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
1,
repeat('3',9)::numeric
)) FROM generate_series(1,1e7);
-- Execution Time: 2314.851 ms (HEAD)
-- Execution Time: 1886.170 ms (div_var_int64.patch)

--
-- Division when the divisor is 16 digits should also be faster:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
1,
repeat('3',16)::numeric
)) FROM generate_series(1,1e7);
-- Execution Time: 2244.009 ms (HEAD)
-- Execution Time: 1968.148 ms (div_var_int64.patch)

--
-- Division when the divisor is 17 digits should be unchanged:
--
EXPLAIN ANALYZE
SELECT count(numeric_div_volatile(
1,
repeat('3',17)::numeric
)) FROM generate_series(1,1e7);
-- Execution Time: 2334.896 ms (HEAD)
-- Execution Time: 2338.141 ms (div_var_int64.patch)

The graph below shows the effect on execution time for numeric_div(),
and also looks at numeric_mod() since it's a heavy user of numeric_div().

The graph was produced by generating 100 random numeric integer values
for each combination of number of dividend/divisor digits between 1 to 20.

In total, that's 20*20*100*2=80000 test values.

As expected, the ceiling for the fast short division is lifted from 8 to 16 divisor digits,
and speedups for modulus is noticed in the same region.

The graph was produced using results from pg-timeit [1]https://github.com/joelonsql/pg-timeit and R for the plotting.

[1]: https://github.com/joelonsql/pg-timeit

/Joel

Attachments:

image.pngimage/png; name=image.pngDownload+4-0
div_var_int64.patchapplication/octet-stream; name=div_var_int64.patchDownload+168-0
#2Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Joel Jacobson (#1)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Sun, 22 Jan 2023 at 13:42, Joel Jacobson <joel@compiler.org> wrote:

Hi,

On platforms where we support 128bit integers, we could accelerate division
when the number of digits in the divisor is larger than 8 and less than or
equal to 16 digits, i.e. when the divisor that fits in a 64-bit integer but would
not fit in a 32-bit integer.

Seems like a reasonable idea, with some pretty decent gains.

Note, however, that for a divisor having fewer than 5 or 6 digits,
it's now significantly slower because it's forced to go through
div_var_int64() instead of div_var_int() for all small divisors. So
the var2ndigits <= 2 case needs to come first.

The implementation of div_var_int64() should be in an #ifdef HAVE_INT128 block.

In div_var_int64(), s/ULONG_MAX/PG_UINT64_MAX/

Regards,
Dean

#3Joel Jacobson
joel@compiler.org
In reply to: Dean Rasheed (#2)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Sun, Jan 22, 2023, at 11:06, Dean Rasheed wrote:

Seems like a reasonable idea, with some pretty decent gains.

Note, however, that for a divisor having fewer than 5 or 6 digits,
it's now significantly slower because it's forced to go through
div_var_int64() instead of div_var_int() for all small divisors. So
the var2ndigits <= 2 case needs to come first.

Can you give a measurable example of when the patch
the way it's written is significantly slower for a divisor having
fewer than 5 or 6 digits, on some platform?

I can't detect any difference at all at my MacBook Pro M1 Max for this example:
EXPLAIN ANALYZE SELECT count(numeric_div_volatile(1,3333)) FROM generate_series(1,1e8);

I did write the code like you suggest first, but changed it,
since I realised the extra "else if" needed could be eliminated,
and thought div_var_int64() wouldn't be slower than div_var_int() since
I thought 64-bit instructions in general are as fast as 32-bit instructions,
on 64-bit platforms.

I'm not suggesting your claim is incorrect, I'm just trying to understand
and verify it experimentally.

The implementation of div_var_int64() should be in an #ifdef HAVE_INT128 block.

In div_var_int64(), s/ULONG_MAX/PG_UINT64_MAX/

OK, thanks, I'll fix, but I'll await your feedback first on the above.

/Joel

#4Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Joel Jacobson (#3)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Sun, 22 Jan 2023 at 15:41, Joel Jacobson <joel@compiler.org> wrote:

On Sun, Jan 22, 2023, at 11:06, Dean Rasheed wrote:

Seems like a reasonable idea, with some pretty decent gains.

Note, however, that for a divisor having fewer than 5 or 6 digits,
it's now significantly slower because it's forced to go through
div_var_int64() instead of div_var_int() for all small divisors. So
the var2ndigits <= 2 case needs to come first.

Can you give a measurable example of when the patch
the way it's written is significantly slower for a divisor having
fewer than 5 or 6 digits, on some platform?

I just modified the previous test you posted:

\timing on
SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e4);

Time: 2048.060 ms (00:02.048) -- HEAD
Time: 2422.720 ms (00:02.423) -- With patch

I did write the code like you suggest first, but changed it,
since I realised the extra "else if" needed could be eliminated,
and thought div_var_int64() wouldn't be slower than div_var_int() since
I thought 64-bit instructions in general are as fast as 32-bit instructions,
on 64-bit platforms.

Apparently it can make a difference. Probably something to do with
having less data to move around. I remember noticing that when I wrote
div_var_int(), which is why I split it into 2 branches in that way.

Regards,
Dean

#5Joel Jacobson
joel@compiler.org
In reply to: Dean Rasheed (#4)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Sun, Jan 22, 2023, at 14:25, Dean Rasheed wrote:

I just modified the previous test you posted:

\timing on
SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e4);

Time: 2048.060 ms (00:02.048) -- HEAD
Time: 2422.720 ms (00:02.423) -- With patch

...

Apparently it can make a difference. Probably something to do with
having less data to move around. I remember noticing that when I wrote
div_var_int(), which is why I split it into 2 branches in that way.

Many thanks for feedback. Nice catch! New patch attached.

Interesting, I'm not able to reproduce this on my MacBook Pro M1 Max:

SELECT version;
PostgreSQL 16devel on aarch64-apple-darwin22.2.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.202), 64-bit

SELECT count(numeric_div_volatile(1e131071,123456)) FROM generate_series(1,1e 4);
Time: 1569.730 ms (00:01.570) - HEAD
Time: 1569.918 ms (00:01.570) -- div_var_int64.patch
Time: 1569.038 ms (00:01.569) -- div_var_int64-2.patch

Just curious, what platform are you on?

/Joel

Attachments:

div_var_int64-2.patchapplication/octet-stream; name=div_var_int64-2.patchDownload+167-1
#6John Naylor
john.naylor@enterprisedb.com
In reply to: Joel Jacobson (#3)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Sun, Jan 22, 2023 at 10:42 PM Joel Jacobson <joel@compiler.org> wrote:

I did write the code like you suggest first, but changed it,
since I realised the extra "else if" needed could be eliminated,
and thought div_var_int64() wouldn't be slower than div_var_int() since
I thought 64-bit instructions in general are as fast as 32-bit

instructions,

on 64-bit platforms.

According to Agner's instruction tables [1]https://www.agner.org/optimize/instruction_tables.pdf, integer division on Skylake
(for example) has a latency of 26 cycles for 32-bit operands, and 42-95
cycles for 64-bit.

[1]: https://www.agner.org/optimize/instruction_tables.pdf

--
John Naylor
EDB: http://www.enterprisedb.com

#7Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: John Naylor (#6)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Mon, 23 Jan 2023 at 05:06, John Naylor <john.naylor@enterprisedb.com> wrote:

According to Agner's instruction tables [1], integer division on Skylake (for example) has a latency of 26 cycles for 32-bit operands, and 42-95 cycles for 64-bit.

[1] https://www.agner.org/optimize/instruction_tables.pdf

Thanks, that's a very useful reference.

(And I do indeed have one of those CPUs, which explains what I was seeing.)

Regards,
Dean

#8Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Joel Jacobson (#5)
Re: [PATCH] Use 128-bit math to accelerate numeric division, when 8 < divisor digits <= 16

On Sun, 22 Jan 2023 at 22:49, Joel Jacobson <joel@compiler.org> wrote:

Many thanks for feedback. Nice catch! New patch attached.

Cool, that resolves the performance issues I was seeing for smaller
divisors (which also had a noticeable impact on the numeric_big
regression test).

After some more testing, the gains look good to me, and I wasn't able
to find any cases where it made things slower, so I've gone ahead and
pushed it.

Regards,
Dean