Slim down integer formatting
Folks,
Please find attached a patch to do $subject. It's down to a one table
lookup and 3 instructions.
In covering the int64 versions, I swiped a light weight division from
the Ryu stuff. I'm pretty sure that what I did is not how to do
#includes, but it's a PoC. What would be a better way to do this?
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachments:
v1-0001-PoC-Slim-down-integer-printing-some-more.patchtext/x-diff; charset=us-asciiDownload
From 1cf7202facd9fee161865d90304e5ede1e3c65cf Mon Sep 17 00:00:00 2001
From: David Fetter <david@fetter.org>
Date: Mon, 26 Jul 2021 16:43:43 -0700
Subject: [PATCH v1] PoC: Slim down integer printing some more
---
src/backend/utils/adt/numutils.c | 62 +++++++++++++++-----------------
1 file changed, 29 insertions(+), 33 deletions(-)
diff --git src/backend/utils/adt/numutils.c src/backend/utils/adt/numutils.c
index b93096f288..c4f2d5cfb1 100644
--- src/backend/utils/adt/numutils.c
+++ src/backend/utils/adt/numutils.c
@@ -18,6 +18,7 @@
#include <limits.h>
#include <ctype.h>
+#include "../common/d2s_intrinsics.h"
#include "common/int.h"
#include "utils/builtins.h"
#include "port/pg_bitutils.h"
@@ -39,50 +40,45 @@ static const char DIGIT_TABLE[200] =
"90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
/*
- * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ * Adapted from
+ * https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/
*/
static inline int
decimalLength32(const uint32 v)
{
- int t;
- static const uint32 PowersOfTen[] = {
- 1, 10, 100,
- 1000, 10000, 100000,
- 1000000, 10000000, 100000000,
- 1000000000
+ /*
+ * Each element in the array, when added to a number with MSB that is the
+ * array index, will produce a number whose top 32 bits are its decimal
+ * length, so that can now be had using:
+ *
+ * 1 CLZ instruction,
+ * 1 table lookup,
+ * 1 add, and
+ * 1 shift
+ *
+ */
+ static const uint64 digit_transitions[32] = {
+ 4294967295, 8589934582, 8589934582, 8589934582, 12884901788,
+ 12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
+ 21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
+ 25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
+ 34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
+ 38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
+ 42949672960, 42949672960
};
- /*
- * Compute base-10 logarithm by dividing the base-2 logarithm by a
- * good-enough approximation of the base-2 logarithm of 10
- */
- t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096;
- return t + (v >= PowersOfTen[t]);
+ return (v + digit_transitions[pg_leftmost_one_pos32(v)]) >> 32;
}
static inline int
decimalLength64(const uint64 v)
{
- int t;
- static const uint64 PowersOfTen[] = {
- UINT64CONST(1), UINT64CONST(10),
- UINT64CONST(100), UINT64CONST(1000),
- UINT64CONST(10000), UINT64CONST(100000),
- UINT64CONST(1000000), UINT64CONST(10000000),
- UINT64CONST(100000000), UINT64CONST(1000000000),
- UINT64CONST(10000000000), UINT64CONST(100000000000),
- UINT64CONST(1000000000000), UINT64CONST(10000000000000),
- UINT64CONST(100000000000000), UINT64CONST(1000000000000000),
- UINT64CONST(10000000000000000), UINT64CONST(100000000000000000),
- UINT64CONST(1000000000000000000), UINT64CONST(10000000000000000000)
- };
-
- /*
- * Compute base-10 logarithm by dividing the base-2 logarithm by a
- * good-enough approximation of the base-2 logarithm of 10
- */
- t = (pg_leftmost_one_pos64(v) + 1) * 1233 / 4096;
- return t + (v >= PowersOfTen[t]);
+ if (v >> 32 == 0)
+ return decimalLength32(v);
+ else if (div1e8(v) >> 32 == 0)
+ return 8 + decimalLength32(div1e8(v));
+ else
+ return 16 + decimalLength32(div1e8(div1e8(v)));
}
/*
--
2.31.1
On Tue, Jul 27, 2021 at 9:51 AM David Fetter <david@fetter.org> wrote:
In covering the int64 versions, I swiped a light weight division from
the Ryu stuff. I'm pretty sure that what I did is not how to do
#includes, but it's a PoC. What would be a better way to do this?
That patch didn't apply for me (on latest source) so I've attached an
equivalent with those changes, that does apply, and also tweaks the
Makefile include path to address that #include issue.
Regards,
Greg Nancarrow
Fujitsu Australia
Attachments:
v2-0001-PoC-Slim-down-integer-printing-some-more.patchapplication/octet-stream; name=v2-0001-PoC-Slim-down-integer-printing-some-more.patchDownload
From d0c41fcad900f1cf593e22cd0e20aa4396c731eb Mon Sep 17 00:00:00 2001
From: Greg Nancarrow <gregn4422@gmail.com>
Date: Tue, 27 Jul 2021 12:03:21 +1000
Subject: [PATCH v2] PoC: Slim down integer printing some more
Author: David Fetter
---
src/backend/utils/adt/Makefile | 2 +-
src/backend/utils/adt/numutils.c | 62 +++++++++++++++-----------------
2 files changed, 30 insertions(+), 34 deletions(-)
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 41b486bcef..287186e3fa 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -8,7 +8,7 @@ subdir = src/backend/utils/adt
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
-override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
+override CPPFLAGS := -I. -I$(srcdir) -I$(top_builddir)/src/common $(CPPFLAGS)
# keep this list arranged alphabetically or it gets to be a mess
OBJS = \
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index b93096f288..03edd572b9 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -19,6 +19,7 @@
#include <ctype.h>
#include "common/int.h"
+#include "d2s_intrinsics.h"
#include "utils/builtins.h"
#include "port/pg_bitutils.h"
@@ -39,50 +40,45 @@ static const char DIGIT_TABLE[200] =
"90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
/*
- * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ * Adapted from
+ * https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/
*/
static inline int
decimalLength32(const uint32 v)
{
- int t;
- static const uint32 PowersOfTen[] = {
- 1, 10, 100,
- 1000, 10000, 100000,
- 1000000, 10000000, 100000000,
- 1000000000
- };
-
/*
- * Compute base-10 logarithm by dividing the base-2 logarithm by a
- * good-enough approximation of the base-2 logarithm of 10
+ * Each element in the array, when added to a number with MSB that is the
+ * array index, will produce a number whose top 32 bits are its decimal
+ * length, so that can now be had using:
+ *
+ * 1 CLZ instruction,
+ * 1 table lookup,
+ * 1 add, and
+ * 1 shift
+ *
*/
- t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096;
- return t + (v >= PowersOfTen[t]);
+ static const uint64 digit_transitions[32] = {
+ 4294967295, 8589934582, 8589934582, 8589934582, 12884901788,
+ 12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
+ 21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
+ 25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
+ 34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
+ 38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
+ 42949672960, 42949672960
+ };
+
+ return (v + digit_transitions[pg_leftmost_one_pos32(v)]) >> 32;
}
static inline int
decimalLength64(const uint64 v)
{
- int t;
- static const uint64 PowersOfTen[] = {
- UINT64CONST(1), UINT64CONST(10),
- UINT64CONST(100), UINT64CONST(1000),
- UINT64CONST(10000), UINT64CONST(100000),
- UINT64CONST(1000000), UINT64CONST(10000000),
- UINT64CONST(100000000), UINT64CONST(1000000000),
- UINT64CONST(10000000000), UINT64CONST(100000000000),
- UINT64CONST(1000000000000), UINT64CONST(10000000000000),
- UINT64CONST(100000000000000), UINT64CONST(1000000000000000),
- UINT64CONST(10000000000000000), UINT64CONST(100000000000000000),
- UINT64CONST(1000000000000000000), UINT64CONST(10000000000000000000)
- };
-
- /*
- * Compute base-10 logarithm by dividing the base-2 logarithm by a
- * good-enough approximation of the base-2 logarithm of 10
- */
- t = (pg_leftmost_one_pos64(v) + 1) * 1233 / 4096;
- return t + (v >= PowersOfTen[t]);
+ if (v >> 32 == 0)
+ return decimalLength32(v);
+ else if (div1e8(v) >> 32 == 0)
+ return 8 + decimalLength32(div1e8(v));
+ else
+ return 16 + decimalLength32(div1e8(div1e8(v)));
}
/*
--
2.27.0
On Tue, Jul 27, 2021 at 12:28:22PM +1000, Greg Nancarrow wrote:
That patch didn't apply for me (on latest source) so I've attached an
equivalent with those changes, that does apply, and also tweaks the
Makefile include path to address that #include issue.
When applying some micro-benchmarking to stress those APIs, how much
does this change things? At the end of the day, this also comes down
to an evaluation of pg_ulltoa_n() and pg_ultoa_n().
#include "common/int.h"
+#include "d2s_intrinsics.h"
Er, are you sure about this part? The first version of the patch did
that in a different, also incorrect, way.
--
Michael
On Tue, 27 Jul 2021 at 14:42, Michael Paquier <michael@paquier.xyz> wrote:
When applying some micro-benchmarking to stress those APIs, how much
does this change things? At the end of the day, this also comes down
to an evaluation of pg_ulltoa_n() and pg_ultoa_n().
I'd suggest something like creating a table with, say 1 million INTs
and testing the performance of copy <table> to '/dev/null';
Repeat for BIGINT
David
On Tue, Jul 27, 2021 at 12:42 PM Michael Paquier <michael@paquier.xyz> wrote:
#include "common/int.h"
+#include "d2s_intrinsics.h"
Er, are you sure about this part? The first version of the patch did
that in a different, also incorrect, way.
Er, I was just trying to help out, so at least the patch could be
applied (whether the patch has merit is a different story).
Are you saying that it's incorrect to include that header file in this
source, or that's the wrong way to do it? (i.e. it's wrong to adjust
the makefile include path to pickup the location where that header is
located and use #include "<header>"? That header is in src/common,
which is not on the default include path).
The method I used certainly works, but you have objections?
Can you clarify what you say is incorrect?
Regards,
Greg Nancarrow
Fujitsu Australia
On Tue, 27 Jul 2021 at 15:08, Greg Nancarrow <gregn4422@gmail.com> wrote:
On Tue, Jul 27, 2021 at 12:42 PM Michael Paquier <michael@paquier.xyz> wrote:
#include "common/int.h"
+#include "d2s_intrinsics.h"
Er, are you sure about this part? The first version of the patch did
that in a different, also incorrect, way.Er, I was just trying to help out, so at least the patch could be
applied (whether the patch has merit is a different story).
Are you saying that it's incorrect to include that header file in this
source, or that's the wrong way to do it? (i.e. it's wrong to adjust
the makefile include path to pickup the location where that header is
located and use #include "<header>"? That header is in src/common,
which is not on the default include path).
The method I used certainly works, but you have objections?
Can you clarify what you say is incorrect?
I think the mistake is that the header file is not in
src/include/common. For some reason, it's ended up with all the .c
files in src/common.
I imagine Andrew did this because he didn't ever expect anything else
to have a use for these. He indicates that in [1]/messages/by-id/87mup9192t.fsf@news-spur.riddles.org.uk.
Maybe Andrew can confirm?
[1]: /messages/by-id/87mup9192t.fsf@news-spur.riddles.org.uk
David
"David" == David Rowley <dgrowleyml@gmail.com> writes:
David> I think the mistake is that the header file is not in
David> src/include/common. For some reason, it's ended up with all the
David> .c files in src/common.
David> I imagine Andrew did this because he didn't ever expect anything
David> else to have a use for these. He indicates that in [1].
David> Maybe Andrew can confirm?
It's not that anything else wouldn't have a use for those, it's that
anything else SHOULDN'T have a use for those because they are straight
imports from upstream Ryu code, they aren't guaranteed to work outside
the ranges of values required by Ryu, and if we decided to import a
newer copy of Ryu then it would be annoying if any other code was broken
as a result.
In short, please don't include d2s_intrinsics.h from anywhere other than
d2s.c
--
Andrew.
On Tue, Jul 27, 2021 at 3:08 PM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
In short, please don't include d2s_intrinsics.h from anywhere other than
d2s.c
Thanks for the clarification.
The patch author will need to take this into account for the patch's
div1e8() usage.
Regards,
Greg Nancarrow
Fujitsu Australia
On 2021-Jul-26, David Fetter wrote:
Folks,
Please find attached a patch to do $subject. It's down to a one table
lookup and 3 instructions.
So how much faster is it than the original?
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
On Wed, 28 Jul 2021 at 01:44, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
So how much faster is it than the original?
I only did some very quick tests. They're a bit noisey. The results
indicate an average speedup of 1.7%, but the noise level is above
that, so unsure.
create table a (a int);
insert into a select a from generate_series(1,1000000)a;
vacuum freeze a;
bench.sql: copy a to '/dev/null';
master @ 93a0bf239
drowley@amd3990x:~$ pgbench -n -f bench.sql -T 60 postgres
latency average = 153.815 ms
latency average = 152.955 ms
latency average = 147.491 ms
master + v2 patch
drowley@amd3990x:~$ pgbench -n -f bench.sql -T 60 postgres
latency average = 144.749 ms
latency average = 151.525 ms
latency average = 150.392 ms
David
On Wed, Jul 28, 2021 at 01:17:43PM +1200, David Rowley wrote:
On Wed, 28 Jul 2021 at 01:44, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
So how much faster is it than the original?
I only did some very quick tests. They're a bit noisey. The results
indicate an average speedup of 1.7%, but the noise level is above
that, so unsure.create table a (a int);
insert into a select a from generate_series(1,1000000)a;
vacuum freeze a;bench.sql: copy a to '/dev/null';
master @ 93a0bf239
drowley@amd3990x:~$ pgbench -n -f bench.sql -T 60 postgres
latency average = 153.815 ms
latency average = 152.955 ms
latency average = 147.491 msmaster + v2 patch
drowley@amd3990x:~$ pgbench -n -f bench.sql -T 60 postgres
latency average = 144.749 ms
latency average = 151.525 ms
latency average = 150.392 ms
Thanks for testing this! I got a few promising results early on with
-O0, and the technique seemed like a neat way to do things.
I generated a million int4s intended to be uniformly distributed
across the range of int4, and similarly across int8.
int4:
patch 6feebcb6b44631c3dc435e971bd80c2dd218a5ab
latency average: 362.149 ms 359.933 ms
latency stddev: 3.44 ms 3.40 ms
int8:
patch 6feebcb6b44631c3dc435e971bd80c2dd218a5ab
latency average: 434.944 ms 422.270 ms
latency stddev: 3.23 ms 4.02 ms
when compiled with -O2:
int4:
patch 6feebcb6b44631c3dc435e971bd80c2dd218a5ab
latency average: 167.262 ms 148.673 ms
latency stddev: 6.26 ms 1.28 ms
i.e. it was actually slower, at least over the 10 runs I did.
I assume that "uniform distribution across the range" is a bad case
scenario for ints, but I was a little surprised to measure worse
performance. Interestingly, what I got for int8s generated to be
uniform across their range was
int8:
patch 6feebcb6b44631c3dc435e971bd80c2dd218a5ab
latency average: 171.737 ms 174.013 ms
latency stddev: 1.94 ms 6.84 ms
which doesn't look like a difference to me.
Intuitively, I'd expect us to get things in the neighborhood of 1 a
lot more often than things in the neighborhood of 1 << (30 or 60). Do
we have some idea of the distribution, or at least of the distribution
family, that we should expect for ints?
Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, 28 Jul 2021 at 14:25, David Fetter <david@fetter.org> wrote:
Intuitively, I'd expect us to get things in the neighborhood of 1 a
lot more often than things in the neighborhood of 1 << (30 or 60). Do
we have some idea of the distribution, or at least of the distribution
family, that we should expect for ints?
serial and bigserial are generally going to start with smaller
numbers. Larger and longer lived databases those numbers could end up
on the larger side. serial and bigserial should be a fairly large
portion of the use case for integer types, so anything that slows down
int4out and int8out for lower order numbers is not a good idea. I
think it would have to be a very small slowdown on the low order
numbers vs a large speedup for higher order numbers for us to even
consider it.
David