Shave a few cycles off our ilog10 implementation

Started by David Fetterabout 1 year ago6 messages

david@fetter.org

about 1 year ago

1 attachment(s)

Please find attached a patch to $Subject

I've done some preliminary testing, and it appears to shave somewhere
between 25-50% off the operations themselves, and these cascade into
things like formatting result sets and COPY OUT.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Attachments:

v1-0001-Shave-a-few-instructions-off-ilog10.patchtext/x-diff; charset=us-asciiDownload

From 612c848d8025d318e5b6e0da5bb04d0f7ed22126 Mon Sep 17 00:00:00 2001
From: David Fetter <david@fetter.org>
Date: Mon, 28 Oct 2024 04:38:16 -0700
Subject: [PATCH v1] Shave a few instructions off ilog10

by using SWAR and a lookup table.
---
 src/backend/utils/adt/numutils.c | 73 ++++++++++++++++++++++++++------
 1 file changed, 60 insertions(+), 13 deletions(-)

diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 63c2beb6a29..a53b647c054 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -39,30 +39,76 @@ static const char DIGIT_TABLE[200] =
 "90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
 
 /*
- * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ * Adapted from https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/
  */
 static inline int
 decimalLength32(const uint32 v)
 {
-	int			t;
-	static const uint32 PowersOfTen[] = {
-		1, 10, 100,
-		1000, 10000, 100000,
-		1000000, 10000000, 100000000,
-		1000000000
+	/* Each entry in the table is a 64-bit unsigned integer.
+	 * - The upper 32 bits of this integer is floor(ilog10(v)), a thing we can know from lg(v).
+	 * - The lower 32 bits of the integer is a number n such that v + n = 2^32 when v is a power of 10.
+	 *
+	 * These two parts taken together create a way to put the number of decimal
+	 * digits as the upper 32 bits.
+	 *
+	 */
+	static const uint64_t BaseTwoToBaseTenPivot[] = {
+		4294967296,  8589934582,  8589934582,  8589934582,  12884901788,
+		12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
+		21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
+		25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
+		34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
+		38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
+		42949672960, 42949672960
 	};
 
-	/*
-	 * Compute base-10 logarithm by dividing the base-2 logarithm by a
-	 * good-enough approximation of the base-2 logarithm of 10
-	 */
-	t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096;
-	return t + (v >= PowersOfTen[t]);
+	return (v + BaseTwoToBaseTenPivot[pg_leftmost_one_pos32(v)]) >> 32;
 }
 
+/*
+ * Adapted from https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/
+ */
 static inline int
 decimalLength64(const uint64 v)
 {
+#ifdef HAS_INT128
+	/* Each entry in the table is a 128-bit unsigned integer.
+	 * - The upper 64 bits of this integer is floor(ilog10(v)), a thing we can know from lg(v).
+	 * - The lower 64 bits of the integer is a number n such that v + n = 2^64 when v is a power of 10.
+	 *
+	 * These two parts taken together create a way to put the number of decimal
+	 * digits as the upper 64 bits.
+	 *
+	 */
+	static const uint128 BaseTwoToBaseTenPivot[] = {
+		 18446744073709551616,
+		 18446744073709551606,  18446744073709551606,  18446744073709551606,
+		 36893488147419103132,  36893488147419103132,  36893488147419103132,
+		 55340232221128653848,  55340232221128653848,  55340232221128653848,
+		 73786976294838196464,  73786976294838196464,  73786976294838196464,
+		 92233720368547658080,  92233720368547658080,  92233720368547658080,
+		110680464442256309696, 110680464442256309696, 110680464442256309696,
+		129127208515956861312, 129127208515956861312, 129127208515956861312,
+		147573952589576412928, 147573952589576412928, 147573952589576412928,
+		166020696662385964544, 166020696662385964544, 166020696662385964544,
+		184467440727095516160, 184467440727095516160, 184467440727095516160,
+		202914184710805067776, 202914184710805067776, 202914184710805067776,
+		221360927884514619392, 221360927884514619392, 221360927884514619392,
+		239807662958224171008, 239807662958224171008, 239807662958224171008,
+		258254317031933722624, 258254317031933722624, 258254317031933722624,
+		276700161105643274240, 276700161105643274240, 276700161105643274240,
+		295137905179352825856, 295137905179352825856, 295137905179352825856,
+		313494649253062377472, 313494649253062377472, 313494649253062377472,
+		331041393326771929088, 331041393326771929088, 331041393326771929088,
+		340488137400481480704, 340488137400481480704, 340488137400481480704,
+		268934881474191032320, 268934881474191032320, 268934881474191032320
+	};
+
+	return (v + BaseTwoToBaseTenPivot[pg_leftmost_one_pos64(v)]) >> 64;
+#else												/* !HAVE_INT128 */
+	/*
+	 * * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+	 */
 	int			t;
 	static const uint64 PowersOfTen[] = {
 		UINT64CONST(1), UINT64CONST(10),
@@ -83,6 +129,7 @@ decimalLength64(const uint64 v)
 	 */
 	t = (pg_leftmost_one_pos64(v) + 1) * 1233 / 4096;
 	return t + (v >= PowersOfTen[t]);
+#endif												/* HAVE_INT128 */
 }
 
 static const int8 hexlookup[128] = {
-- 
2.47.0

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: David Fetter (#1)

Re: Shave a few cycles off our ilog10 implementation

On 30/10/2024 21:27, David Fetter wrote:

Please find attached a patch to $Subject

I've done some preliminary testing, and it appears to shave somewhere
between 25-50% off the operations themselves, and these cascade into
things like formatting result sets and COPY OUT.

Impressive! What did you use to performance test it, to get those results?

--
Heikki Linnakangas
Neon (https://neon.tech)

David Fetter

david@fetter.org

about 1 year ago

In reply to: Heikki Linnakangas (#2)

Re: Shave a few cycles off our ilog10 implementation

On Wed, Oct 30, 2024 at 09:54:20PM +0200, Heikki Linnakangas wrote:

On 30/10/2024 21:27, David Fetter wrote:

Please find attached a patch to $Subject

I've done some preliminary testing, and it appears to shave somewhere
between 25-50% off the operations themselves, and these cascade into
things like formatting result sets and COPY OUT.

Impressive! What did you use to performance test it, to get those results?

In case that wasn't clear, what I've tested so far was the ilog10
implementations, not the general effects on the things they underlie.

This testing was basically just sending a bunch of appropriately sized
pseudo-random uints in a previously created array sent through a tight
loop that called the ilog10s and getting average execution times.

Any suggestions for more thorough testing would be welcome.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: David Fetter (#3)

Re: Shave a few cycles off our ilog10 implementation

On Thu, 31 Oct 2024 at 09:02, David Fetter <david@fetter.org> wrote:

This testing was basically just sending a bunch of appropriately sized
pseudo-random uints in a previously created array sent through a tight
loop that called the ilog10s and getting average execution times.

Any suggestions for more thorough testing would be welcome.

Maybe something similar to what I did in [1]/messages/by-id/CAApHDvopR=yPgNr4AbbN4HMOztuyVa+iFYRTvu49pxg9YO_tKw@mail.gmail.com.

David

[1]: /messages/by-id/CAApHDvopR=yPgNr4AbbN4HMOztuyVa+iFYRTvu49pxg9YO_tKw@mail.gmail.com

John Naylor

johncnaylorls@gmail.com

about 1 year ago

In reply to: David Rowley (#4)

Re: Shave a few cycles off our ilog10 implementation

On Thu, Oct 31, 2024 at 3:14 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 31 Oct 2024 at 09:02, David Fetter <david@fetter.org> wrote:

This testing was basically just sending a bunch of appropriately sized
pseudo-random uints in a previously created array sent through a tight
loop that called the ilog10s and getting average execution times.

Any suggestions for more thorough testing would be welcome.

Maybe something similar to what I did in [1].

David

[1] /messages/by-id/CAApHDvopR=yPgNr4AbbN4HMOztuyVa+iFYRTvu49pxg9YO_tKw@mail.gmail.com

I tried this test as-is and saw no difference. I bumped it up to 10
columns and got (no turbo, about 30 transactions):

master:
6831.308 ms
patch:
6580.506 ms

The difference is small enough that normally I'd say it's likely
unrelated to the patch, but on the other hand it's consistent with
saving (3 * 10 * 10 million) cycles because of 1 less multiplication
each, which is not nothing, but for shoving bytes into /dev/null it's
not exciting either. The lookup for the 64-bit case has grown to 1024
bytes, which will compete for cache space. I don't have a strong
reason to be either for or against this patch. Anyone else want to
test?

create table bi (a bigint, b bigint, c bigint, d bigint, e bigint, f
bigint, g bigint, h bigint, i bigint, j bigint);
insert into bi select i,i,i,i,i,i,i,i,i,i from generate_Series(1,10_000_000) i;
vacuum freeze analyze bi;

pgbench -n -T 180 -f bench.sql
--
John Naylor
Amazon Web Services

David Rowley

dgrowleyml@gmail.com

about 1 year ago

In reply to: John Naylor (#5)

1 attachment(s)

Re: Shave a few cycles off our ilog10 implementation

On Wed, 18 Dec 2024 at 23:42, John Naylor <johncnaylorls@gmail.com> wrote:

The difference is small enough that normally I'd say it's likely
unrelated to the patch, but on the other hand it's consistent with
saving (3 * 10 * 10 million) cycles because of 1 less multiplication
each, which is not nothing, but for shoving bytes into /dev/null it's
not exciting either. The lookup for the 64-bit case has grown to 1024
bytes, which will compete for cache space. I don't have a strong
reason to be either for or against this patch. Anyone else want to
test?

I tried it out too on my Zen4 machine. I don't doubt David saw a
speedup when testing the performance in isolation, but I can't detect
anything going faster when using it in Postgres.

Maybe we can revisit if we make COPY TO faster someday. As of today,
it's a pretty inefficient lump of code.

My results:

$ echo master && ./intbench.sh
master
NOTICE: relation "tmp" already exists, skipping
CREATE TABLE AS
latency average = 246.294 ms
latency average = 243.167 ms
latency average = 245.620 ms
latency average = 247.135 ms
latency average = 248.206 ms
latency average = 253.433 ms
latency average = 259.296 ms
latency average = 248.856 ms
latency average = 247.518 ms
latency average = 259.581 ms
latency average = 244.426 ms
latency average = 244.553 ms
latency average = 249.909 ms
latency average = 244.079 ms
latency average = 246.422 ms
latency average = 248.763 ms
latency average = 247.318 ms
latency average = 249.675 ms
latency average = 245.192 ms
latency average = 253.975 ms

$ echo patched && ./intbench.sh
patched
NOTICE: relation "tmp" already exists, skipping
CREATE TABLE AS
latency average = 253.964 ms
latency average = 257.463 ms
latency average = 250.506 ms
latency average = 252.401 ms
latency average = 260.806 ms
latency average = 250.120 ms
latency average = 251.539 ms
latency average = 262.180 ms
latency average = 252.349 ms
latency average = 251.332 ms
latency average = 249.490 ms
latency average = 252.696 ms
latency average = 251.895 ms
latency average = 248.466 ms
latency average = 255.839 ms
latency average = 253.334 ms
latency average = 250.548 ms
latency average = 288.164 ms
latency average = 252.587 ms
latency average = 256.059 ms

perf top:

master:
16.59% postgres [.] CopyAttributeOutText
15.63% libc.so.6 [.] __memmove_avx512_unaligned_erms
12.94% postgres [.] pg_ltoa
9.85% postgres [.] CopyOneRowTo
6.86% postgres [.] AllocSetAlloc
6.73% postgres [.] tts_buffer_heap_getsomeattrs

patched
19.53% libc.so.6 [.] __memmove_avx512_unaligned_erms
12.52% postgres [.] pg_ltoa
11.76% postgres [.] CopyAttributeOutText
11.40% postgres [.] CopyOneRowTo
6.96% postgres [.] tts_buffer_heap_getsomeattrs
6.35% postgres [.] AllocSetAlloc

I can't think of what we have that exercises pg_ltoa() or pg_ultoa_n()
more. timestamp_out() might, but that's lots of small ints.

David