Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

Started by Ranier Vilelaalmost 2 years ago7 messages
#1Ranier Vilela
ranier.vf@gmail.com
1 attachment(s)

Hi,

IMO I believe that bitmapset can obtain an optimization in the calculation
of the WORDNUM and BITNUM macros.

As you know, in bitmapset, negative members are not allowed.

if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");

Then, allow the compiler to optimize and do the calculations in unsigned.

I did a demonstration in compiler explorer.
https://godbolt.org/z/c68o43Eq1

It is demonstrated here as well.

Generated instructions: GCC 13.2

#define WORDNUM_f1(x) ((x) / BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
lea edx, [rax+63]
test eax, eax
cmovs eax, edx
sar eax, 6
mov DWORD PTR [rbp-4], eax

#define WORDNUM_f2(x) ((bitmapword) (x) / BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
cdqe
shr rax, 6
mov DWORD PTR [rbp-4], eax

#define BITNUM_f1(x) ((x) % BITS_PER_BITMAPWORD)
mov edx, DWORD PTR [rbp-20]
mov eax, edx
sar eax, 31
shr eax, 26
add edx, eax
and edx, 63
sub edx, eax
mov DWORD PTR [rbp-4], edx

#define BITNUM_f2(x) ((bitmapword) (x) % BITS_PER_BITMAPWORD)
mov eax, DWORD PTR [rbp-20]
and eax, 63
mov DWORD PTR [rbp-4], eax

As you can see, when calculations are done in unsigned,
the generated instructions are optimized.

Patch attached.

Best regards,
Ranier Vilela

Attachments:

optimize_bitmapset_bitmapword_calc.patchapplication/x-patch; name=optimize_bitmapset_bitmapword_calc.patchDownload
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..0765dd42c2 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -44,8 +44,8 @@
 #include "port/pg_bitutils.h"
 
 
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x)	((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((bitmapword)(x) % BITS_PER_BITMAPWORD)
 
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Ranier Vilela (#1)
Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

On Mon, Jan 29, 2024 at 01:30:47PM -0300, Ranier Vilela wrote:

IMO I believe that bitmapset can obtain an optimization in the calculation
of the WORDNUM and BITNUM macros.

As you know, in bitmapset, negative members are not allowed.

if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");

Then, allow the compiler to optimize and do the calculations in unsigned.

I'm currently +0.1 for this change. I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...

-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x)	((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((bitmapword)(x) % BITS_PER_BITMAPWORD)

I'm curious why we'd cast to bitmapword and not straight to uint32. I
don't think the intent is that callers will provide a bitmapword to these
macros. I also wonder if it's worth asserting that x is >= 0 before
casting here.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#3Ranier Vilela
ranier.vf@gmail.com
In reply to: Nathan Bossart (#2)
Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

Em seg., 29 de jan. de 2024 às 16:32, Nathan Bossart <
nathandbossart@gmail.com> escreveu:

On Mon, Jan 29, 2024 at 01:30:47PM -0300, Ranier Vilela wrote:

IMO I believe that bitmapset can obtain an optimization in the

calculation

of the WORDNUM and BITNUM macros.

As you know, in bitmapset, negative members are not allowed.

if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");

Then, allow the compiler to optimize and do the calculations in unsigned.

I'm currently +0.1 for this change. I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...

Bitmapset is extensively used.

-#define WORDNUM(x)     ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)      ((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x)     ((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)      ((bitmapword)(x) % BITS_PER_BITMAPWORD)

I'm curious why we'd cast to bitmapword and not straight to uint32. I
don't think the intent is that callers will provide a bitmapword to these
macros.

bitmapword It is the most correct and prudent option, if in the future,
we decide to change the number of nwords to uint64.

I also wonder if it's worth asserting that x is >= 0 before
casting here.

I don't think this would change anything.

best regards,
Ranier Vilela

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Ranier Vilela (#3)
Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

On Mon, Jan 29, 2024 at 04:43:32PM -0300, Ranier Vilela wrote:

Em seg., 29 de jan. de 2024 �s 16:32, Nathan Bossart <
nathandbossart@gmail.com> escreveu:

-#define WORDNUM(x)     ((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)      ((x) % BITS_PER_BITMAPWORD)
+#define WORDNUM(x)     ((bitmapword)(x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)      ((bitmapword)(x) % BITS_PER_BITMAPWORD)

I'm curious why we'd cast to bitmapword and not straight to uint32. I
don't think the intent is that callers will provide a bitmapword to these
macros.

bitmapword It is the most correct and prudent option, if in the future,
we decide to change the number of nwords to uint64.

If we change nwords to a uint64, I think there will be many other changes
required. Using uint32 probably trims the instructions further on machines
with 64-bit pointers, too (e.g., cdqe).

I also wonder if it's worth asserting that x is >= 0 before
casting here.

I don't think this would change anything.

Right, but it would offer another layer of protection against negative
integers in Bitmapsets.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#5David Rowley
dgrowleyml@gmail.com
In reply to: Nathan Bossart (#2)
Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote:

I'm currently +0.1 for this change. I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...

FWIW, In [1]/messages/by-id/CAApHDvq9eq0W_aFUGrb6ba28ieuQN4zM5Uwqxy7+LMZjJc+VGg@mail.gmail.com and subsequent replies, there are several examples of
benchmarks where various bitmapset functions are sitting high in the
profiles. So I wouldn't be too surprised if such a small change to the
WORDNUM and BITNUM macros made a noticeable difference.

A benchmark speaks a thousand words, however.

David

[1]: /messages/by-id/CAApHDvq9eq0W_aFUGrb6ba28ieuQN4zM5Uwqxy7+LMZjJc+VGg@mail.gmail.com

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: David Rowley (#5)
Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

On Tue, Jan 30, 2024 at 11:23:57AM +1300, David Rowley wrote:

On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com> wrote:

I'm currently +0.1 for this change. I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable impact.
However, a cycle saved is a cycle earned...

FWIW, In [1] and subsequent replies, there are several examples of
benchmarks where various bitmapset functions are sitting high in the
profiles. So I wouldn't be too surprised if such a small change to the
WORDNUM and BITNUM macros made a noticeable difference.

Good to know, thanks. If there is indeed demonstrable improvement, I'd
readily adjust my +0.1 to +1.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#7Ranier Vilela
ranier.vf@gmail.com
In reply to: Nathan Bossart (#6)
Re: Optmize bitmapword macros calc (src/backend/nodes/bitmapset.c)

Em seg., 29 de jan. de 2024 às 19:40, Nathan Bossart <
nathandbossart@gmail.com> escreveu:

On Tue, Jan 30, 2024 at 11:23:57AM +1300, David Rowley wrote:

On Tue, 30 Jan 2024 at 08:32, Nathan Bossart <nathandbossart@gmail.com>

wrote:

I'm currently +0.1 for this change. I don't see any huge problem with
trimming a few instructions, but I'm dubious there's any measurable

impact.

However, a cycle saved is a cycle earned...

FWIW, In [1] and subsequent replies, there are several examples of
benchmarks where various bitmapset functions are sitting high in the
profiles. So I wouldn't be too surprised if such a small change to the
WORDNUM and BITNUM macros made a noticeable difference.

Good to know, thanks. If there is indeed demonstrable improvement, I'd
readily adjust my +0.1 to +1.

Following the suggestions, I did a quick test with one of the scripts.

Ubuntu 64 bits
gcc 12.3 64 bits

create table t1 (a int) partition by list(a);
select 'create table t1_'||x||' partition of t1 for values
in('||x||');' from generate_series(0,9)x;

test1.sql
select * from t1 where a > 1 and a < 3;

pgbench -U postgres -n -f test1.sql -T 15 postgres

head:

tps = 27983.182940
tps = 28916.903038
tps = 29051.878855

patched:

tps = 27517.301246
tps = 27848.684133
tps = 28669.367300

create table t2 (a int) partition by list(a);
select 'create table t2_'||x||' partition of t2 for values
in('||x||');' from generate_series(0,9999)x;

test2.sql
select * from t2 where a > 1 and a < 3;

pgbench -U postgres -n -f test2.sql -T 15 postgres

head:

tps = 27144.044463
tps = 28932.948620
tps = 29299.016248

patched:

tps = 27363.364039
tps = 28588.141586
tps = 28669.367300

To my complete surprise, the change is slower.
I can't say how, with fewer instructions, gcc makes the binary worse.

best regards,
Ranier Vilela