Small and unlikely overflow hazard in bms_next_member()
I've been working on bms_left_shift_members() to bitshift members
either left or right in order to tidy up some existing code and
improve a future Bitmapset use case I'm currently working on.
When testing some ERROR code I added to ensure we don't get an
excessively large left shift value and end up with members higher than
INT32_MAX, I discovered that bms_next_member() can't handle that
value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
access a negative array index, i.e. seg fault.
I appreciate that such a large member is quite unlikely, but if this
isn't fixed then I need to code my error checking code to disallow
members >= INT32_MAX rather than > INT32_MAX. I did have a comment
explaining why I was doing that, but fixing the bug saves the weird
special case and the comment.
Patched attached. I was thinking it might not be worthy of
backpatching, but I'll entertain alternative views on that.
David
Attachments:
bms_next_member_fix.patchapplication/octet-stream; name=bms_next_member_fix.patchDownload+5-4
David Rowley <dgrowleyml@gmail.com> writes:
When testing some ERROR code I added to ensure we don't get an
excessively large left shift value and end up with members higher than
INT32_MAX, I discovered that bms_next_member() can't handle that
value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
access a negative array index, i.e. seg fault.
I appreciate that such a large member is quite unlikely,
I think it's impossible, and if it's not then this is not the
only place in bitmapset.c that could theoretically overflow.
As an example, bms_prev_member does
Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD);
but if the bitmapset were large enough to accommodate INT_MAX
as a member then a->nwords * BITS_PER_BITMAPWORD must overflow.
I don't think we should add cycles here for this purpose.
If it makes you feel better, maybe add Asserts to
bms_make_singleton and bms_add_member to constrain the
maximum member value to somewhat less than INT_MAX?
regards, tom lane
On Apr 2, 2026, at 12:09, David Rowley <dgrowleyml@gmail.com> wrote:
I've been working on bms_left_shift_members() to bitshift members
either left or right in order to tidy up some existing code and
improve a future Bitmapset use case I'm currently working on.When testing some ERROR code I added to ensure we don't get an
excessively large left shift value and end up with members higher than
INT32_MAX, I discovered that bms_next_member() can't handle that
value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
access a negative array index, i.e. seg fault.I appreciate that such a large member is quite unlikely, but if this
isn't fixed then I need to code my error checking code to disallow
members >= INT32_MAX rather than > INT32_MAX. I did have a comment
explaining why I was doing that, but fixing the bug saves the weird
special case and the comment.Patched attached. I was thinking it might not be worthy of
backpatching, but I'll entertain alternative views on that.David
<bms_next_member_fix.patch>
Both the return value of bms_next_member() and the parameter “prevbit" are defined as int, which seems to imply that a bitmap can hold at most INT32_MAX elements. So I wonder if a cleaner fix would be to just add a range guard, like this:
```
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 786f343b3c9..7f79f81f278 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit)
Assert(bms_is_valid_set(a));
- if (a == NULL)
+ if (a == NULL || prevbit == INT32_MAX)
return -2;
nwords = a->nwords;
prevbit++;
```
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Chao Li <li.evan.chao@gmail.com> 于2026年4月2日周四 14:39写道:
On Apr 2, 2026, at 12:09, David Rowley <dgrowleyml@gmail.com> wrote:
I've been working on bms_left_shift_members() to bitshift members
either left or right in order to tidy up some existing code and
improve a future Bitmapset use case I'm currently working on.When testing some ERROR code I added to ensure we don't get an
excessively large left shift value and end up with members higher than
INT32_MAX, I discovered that bms_next_member() can't handle that
value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
access a negative array index, i.e. seg fault.I appreciate that such a large member is quite unlikely, but if this
isn't fixed then I need to code my error checking code to disallow
members >= INT32_MAX rather than > INT32_MAX. I did have a comment
explaining why I was doing that, but fixing the bug saves the weird
special case and the comment.Patched attached. I was thinking it might not be worthy of
backpatching, but I'll entertain alternative views on that.David
<bms_next_member_fix.patch>Both the return value of bms_next_member() and the parameter “prevbit" are
defined as int, which seems to imply that a bitmap can hold at most
INT32_MAX elements. So I wonder if a cleaner fix would be to just add a
range guard, like this:``` diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..7f79f81f278 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit)Assert(bms_is_valid_set(a));
- if (a == NULL) + if (a == NULL || prevbit == INT32_MAX) return -2; nwords = a->nwords; prevbit++; ```Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Hi,
The both solutions look good to me. I am slightly keen on Chao's version
that looks simpler to me.
Thanks!
wang jie
On Thu, 2 Apr 2026 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD);
but if the bitmapset were large enough to accommodate INT_MAX
as a member then a->nwords * BITS_PER_BITMAPWORD must overflow.
I missed that one. That's annoying. Even "prevbit = a->nwords *
BITS_PER_BITMAPWORD - 1;" is undefined if it wraps due to the signed
maths.
I don't think we should add cycles here for this purpose.
I'm not keen on slowing things down for this either. I did do some
experiments in [1]https://godbolt.org/z/Eh1vzssq7 that sees fewer instructions from using 64-bit
maths. I might go off and see if there are any wins there that also
give us the INT_MAX fix. It's not great effort to reward ratio
though...
David
On Thu, 2 Apr 2026 at 19:39, Chao Li <li.evan.chao@gmail.com> wrote:
Both the return value of bms_next_member() and the parameter “prevbit" are defined as int, which seems to imply that a bitmap can hold at most INT32_MAX elements. So I wonder if a cleaner fix would be to just add a range guard, like this:
``` diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..7f79f81f278 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit)Assert(bms_is_valid_set(a));
- if (a == NULL) + if (a == NULL || prevbit == INT32_MAX)
I don't think that's a nice way at all. Adding a special case plus
extra instructions, including a new jump instruction for the boolean
short-circuiting. More instruction decoding, more L1i space needed,
more branches to predict and more space in the branch predictor tables
to overwrite other useful branches. Adds run-time overhead.
I was aiming for low overhead and no special cases. 2a600a93c means we
don't care about the performance on 32-bit hardware anymore, so that
can't be used as a counterargument.
David
On Fri, 3 Apr 2026 at 11:12, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 2 Apr 2026 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't think we should add cycles here for this purpose.
I'm not keen on slowing things down for this either. I did do some
experiments in [1] that sees fewer instructions from using 64-bit
maths. I might go off and see if there are any wins there that also
give us the INT_MAX fix. It's not great effort to reward ratio
though...
The reduction in instructions with the patched version got me curious
to see if it would translate into a performance increase. I tested on
an AMD Zen2 machine, and it's a decent amount faster than master. I
tested with gcc and clang.
I also scanned over the remaining parts of bitmapset.c and didn't find
anywhere else that has overflow risk aside from what you pointed out
in bms_prev_member().
The attached patch contains the benchmark function I added to the
test_bitmapset module. It should apply to master with a bit of noise.
CREATE EXTENSION test_bitmapset;
SELECT
generate_series(1,3) AS run,
bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
bms_next_member_us,
bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
bms_prev_member_us;
master (gcc)
run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
1 | 26473 | 40404
2 | 26218 | 40413
3 | 26209 | 40387
patched (gcc)
run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
1 | 25409 | 29705
2 | 24905 | 29693
3 | 24870 | 29707
Times are in microseconds to do 1 million bms_*_member() loops over
the entire set.
I've also attached the full results I got. I've also included the
results from Chao's version, which does slow things down decently on
clang.
IMO, if we can make bitmapset.c work with INT_MAX members and get a
performance increase, then we should do it.
David
Show quoted text
On Apr 3, 2026, at 10:08, David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 3 Apr 2026 at 11:12, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 2 Apr 2026 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I don't think we should add cycles here for this purpose.
I'm not keen on slowing things down for this either. I did do some
experiments in [1] that sees fewer instructions from using 64-bit
maths. I might go off and see if there are any wins there that also
give us the INT_MAX fix. It's not great effort to reward ratio
though...The reduction in instructions with the patched version got me curious
to see if it would translate into a performance increase. I tested on
an AMD Zen2 machine, and it's a decent amount faster than master. I
tested with gcc and clang.I also scanned over the remaining parts of bitmapset.c and didn't find
anywhere else that has overflow risk aside from what you pointed out
in bms_prev_member().The attached patch contains the benchmark function I added to the
test_bitmapset module. It should apply to master with a bit of noise.CREATE EXTENSION test_bitmapset;
SELECT
generate_series(1,3) AS run,
bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
bms_next_member_us,
bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
bms_prev_member_us;master (gcc)
run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
1 | 26473 | 40404
2 | 26218 | 40413
3 | 26209 | 40387patched (gcc)
run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
1 | 25409 | 29705
2 | 24905 | 29693
3 | 24870 | 29707Times are in microseconds to do 1 million bms_*_member() loops over
the entire set.I've also attached the full results I got. I've also included the
results from Chao's version, which does slow things down decently on
clang.IMO, if we can make bitmapset.c work with INT_MAX members and get a
performance increase, then we should do it.David
<benchmark_results.txt><bms_fixes.patch>
I also did a load test with a standalone c program with 4 versions:
* The original bms_next_member (Original)
* The fast version from [1]https://godbolt.org/z/Eh1vzssq7, that uses 64bit maths (Fast)
* The original version + INT32_MAX check + 64bit maths (Original2)
* I tried the other approach that pulls up the first iteration, so that removes "mask = (~(bitmapword) 0);” from the loop. (PullUp)
Note: all tests used -O2 to build the executable.
On my MacBook M4, the Fast version constantly won, and PullUp version performed badly.
```
% gcc --version
Apple clang version 17.0.0 (clang-1700.6.4.2)
Target: arm64-apple-darwin25.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
```
A typical test run:
```
Benchmarking 100000 iterations...
Original: 0.48893 seconds
Fast: 0.46979 seconds
Original2: 0.47740 seconds
PullUp: 0.48029 seconds
```
On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, Orignal2 won in the most runs, and the PullUp version was faster than Fast version.
```
chaol@lichao-highgo:~$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```
A typical test run:
```
Original: 0.99849 seconds
Fast: 0.74722 seconds
Original2: 0.59407 seconds
PullUp: 0.62746 seconds
```
Then I also tried to run on Windows directly. Here, PullUp version performed the best.
```
$ gcc --version
gcc.exe (Rev13, Built by MSYS2 project) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```
A typical test run:
```
Original: 0.32931 seconds
Fast: 0.32740 seconds
Original2: 0.32378 seconds
PullUp: 0.30795 seconds
```
I’m curious that, when something performs differently across platforms, which platform should take priority?
Please see the attached test program. It’s possible I did something wrong.
[1]: https://godbolt.org/z/Eh1vzssq7
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Attachments:
On Fri, 3 Apr 2026 at 16:24, Chao Li <li.evan.chao@gmail.com> wrote:
I also did a load test with a standalone c program with 4 versions:
I think you'd be better off picking a more realistic Bitmapset. The
code is doing:
int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
sizeof(bitmapword));
bms->nwords = words_to_alloc;
memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
/* Set a bit far into the set to force a long scan */
int target_bit = (words_to_alloc - 1) * 64 + 10;
bms->words[words_to_alloc - 1] |= (1ULL << 10);
A set with 20k words! Then you're doing:
for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
So you're not looping correctly over the set, as looping over a set
with 1 member requires 2 calls to the function.
I'd say this unrealistically favours the unrolled loop version, as
most of the effort is in the loop looping over empty words and you can
do that without the extra bit-wise AND that the other versions have.
That version also seems to apply both the 64-bit fix and the INT_MAX
fix. Why both? Because you're only calling the function once per
iteration, it also means your || INT_MAX is only executed once, rather
than n_members + 1 as it would normally be executed. That'll make your
version seem better than it is.
I fixed all that and changed the Bitmapset to something more realistic
with how it's more likely to be seen in PostgreSQL and ran the
benchmark with a proper bms_next_member loop. File attached and
results below:
Zen2 Linux
$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...
Original: 0.62113 seconds
David's: 0.70257 seconds
Chao's version: 1.70691 seconds
Unrolled loop: 1.56239 seconds
2800000000
$ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...
Original: 1.11263 seconds
David's: 0.87399 seconds
Chao's version: 0.52962 seconds
Unrolled loop: 1.07799 seconds
2800000000
Apple M2:
Benchmarking 100000000 iterations...
Original: 0.64023 seconds
David's: 0.41087 seconds
Chao's version: 1.21113 seconds
Unrolled loop: 0.55924 seconds
2800000000
I’m curious that, when something performs differently across platforms, which platform should take priority?
I don't think it matters too much for this case. If we were actually
working on a function that was a bottleneck, then it would be worth
looking deeper and seeing if the compilers are producing suboptimal
code and then try to coax them into making better code. As a last
resort, have two versions and some preprocessor to decide which one.
However, this just isn't performance-critical enough to warrant such
an effort. I think we're already going far too far with this. I just
want to fix the bug. If performance increases in some way as a result
of that, then good.
If I sum up the times from each version of the above results, I get:
Original: 2.37399
David's: 1.98743
Chao's: 3.44766
Unrolled: 3.19962
I'm happy to go with the fastest one. I expect the one with the fewest
instructions is likely more important when running it in the planner,
as that's less to decode. I didn't look, but I expect your loop
unrolled version and your || INT_MAX version to have more
instructions.
David
Attachments:
test_bms_next2.ctext/plain; charset=US-ASCII; name=test_bms_next2.cDownload
On Apr 3, 2026, at 19:42, David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 3 Apr 2026 at 16:24, Chao Li <li.evan.chao@gmail.com> wrote:
I also did a load test with a standalone c program with 4 versions:
I think you'd be better off picking a more realistic Bitmapset. The
code is doing:int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
sizeof(bitmapword));
bms->nwords = words_to_alloc;
memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));/* Set a bit far into the set to force a long scan */
int target_bit = (words_to_alloc - 1) * 64 + 10;
bms->words[words_to_alloc - 1] |= (1ULL << 10);A set with 20k words! Then you're doing:
for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
So you're not looping correctly over the set, as looping over a set
with 1 member requires 2 calls to the function.I'd say this unrealistically favours the unrolled loop version, as
most of the effort is in the loop looping over empty words and you can
do that without the extra bit-wise AND that the other versions have.
That version also seems to apply both the 64-bit fix and the INT_MAX
fix. Why both? Because you're only calling the function once per
iteration, it also means your || INT_MAX is only executed once, rather
than n_members + 1 as it would normally be executed. That'll make your
version seem better than it is.I fixed all that and changed the Bitmapset to something more realistic
with how it's more likely to be seen in PostgreSQL and ran the
benchmark with a proper bms_next_member loop. File attached and
results below:Zen2 Linux
$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...Original: 0.62113 seconds
David's: 0.70257 seconds
Chao's version: 1.70691 seconds
Unrolled loop: 1.56239 seconds
2800000000$ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...Original: 1.11263 seconds
David's: 0.87399 seconds
Chao's version: 0.52962 seconds
Unrolled loop: 1.07799 seconds
2800000000Apple M2:
Benchmarking 100000000 iterations...
Original: 0.64023 seconds
David's: 0.41087 seconds
Chao's version: 1.21113 seconds
Unrolled loop: 0.55924 seconds
2800000000I’m curious that, when something performs differently across platforms, which platform should take priority?
I don't think it matters too much for this case. If we were actually
working on a function that was a bottleneck, then it would be worth
looking deeper and seeing if the compilers are producing suboptimal
code and then try to coax them into making better code. As a last
resort, have two versions and some preprocessor to decide which one.
However, this just isn't performance-critical enough to warrant such
an effort. I think we're already going far too far with this. I just
want to fix the bug. If performance increases in some way as a result
of that, then good.If I sum up the times from each version of the above results, I get:
Original: 2.37399
David's: 1.98743
Chao's: 3.44766
Unrolled: 3.19962I'm happy to go with the fastest one. I expect the one with the fewest
instructions is likely more important when running it in the planner,
as that's less to decode. I didn't look, but I expect your loop
unrolled version and your || INT_MAX version to have more
instructions.David
<test_bms_next2.c>
In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...
Original: 0.61559 seconds
David's: 0.61651 seconds
Chao's version: 1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```
(I also checked on Godbolt, from the instruction-count perspective, the unrolled version actually looks the worst.)
I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.
It was fun working on this patch, and thank you for your patience.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sat, 4 Apr 2026 at 02:28, Chao Li <li.evan.chao@gmail.com> wrote:
In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...Original: 0.61559 seconds
David's: 0.61651 seconds
Chao's version: 1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```
Doing the same locally, here's what I get on my Zen2 machine:
drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...
Original: 1.21125 seconds
David's: 1.11997 seconds
Chao's version: 1.72662 seconds
Unrolled loop: 1.63090 seconds
2800000000
drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next &&
./test_bms_next
Benchmarking 100000000 iterations...
Original: 1.10780 seconds
David's: 1.05968 seconds
Chao's version: 1.87123 seconds
Unrolled loop: 1.11002 seconds
2800000000
I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.
Yeah, but it's better to find bottlenecks and focus there than to
optimise blindly without knowing if it'll make any actual difference.
It's still important to test for regressions when modifying code and
try to avoid them, as it's sometimes not obvious if the code being
modified is critical for performance.
I also don't think we should be doing anything to optimise for
multi-word sets at the expense of slowing down single-word sets. Not
that it's very representative of the real world, but I added some
telemetry to bitmapset.c and I see 43 multi-word sets being created
and 1.45 million single-word sets created during make check, or 1 in
33785, if you prefer.
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 786f343b3c9..21b5053e9a2 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -219,6 +219,10 @@ bms_make_singleton(int x)
int wordnum,
bitnum;
+ if (x >= 64)
+ elog(NOTICE, "multi-word set bms_make_singleton");
+ else
+ elog(NOTICE, "single-word set bms_make_singleton");
if (x < 0)
elog(ERROR, "negative bitmapset member not allowed");
wordnum = WORDNUM(x);
@@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x)
wordnum = WORDNUM(x);
bitnum = BITNUM(x);
+ if (x >= 64 && a->nwords == 1)
+ elog(NOTICE, "multi-set bms_add_member");
+
/* enlarge the set if necessary */
if (wordnum >= a->nwords)
{
Not quite perfect as a set made first as a single word set that later
becomes a multi-word set will be double counted. The number of
operations on the sets is likely more important anyway, not the number
of sets being created. The point is, multi-word sets are rare for most
workloads.
David
On Apr 4, 2026, at 11:30, David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 4 Apr 2026 at 02:28, Chao Li <li.evan.chao@gmail.com> wrote:
In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...Original: 0.61559 seconds
David's: 0.61651 seconds
Chao's version: 1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```Doing the same locally, here's what I get on my Zen2 machine:
drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...Original: 1.21125 seconds
David's: 1.11997 seconds
Chao's version: 1.72662 seconds
Unrolled loop: 1.63090 seconds
2800000000
drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next &&
./test_bms_next
Benchmarking 100000000 iterations...Original: 1.10780 seconds
David's: 1.05968 seconds
Chao's version: 1.87123 seconds
Unrolled loop: 1.11002 seconds
2800000000
I just tried my Windows laptop with both MSYS2 and WSL Ubuntu, and on both of them the original version and David’s version consistently performed better than the other versions.
I have a new finding from Godbolt: adding likely(w != 0) seems to reduce the instruction count by one. bms_next_member_fast has 26 instructions, while after adding "likely", bms_next_member_fast2 has 25. It seems that this may avoid one jump and perhaps be slightly better, but I’m not sure. I’m really not an expert in assembly.
Then I tested. “likely” doesn’t seem to help on Mac:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...
Original: 0.54994 seconds
David's: 0.78218 seconds
David's likely: 0.78990 seconds
Chao's version: 1.11530 seconds
Unrolled loop: 0.47660 seconds
3500000000
```
But it helps slightly on Windows:
```
$ ./test_bms_next2
Benchmarking 100000000 iterations...
Original: 1.45312 seconds
David's: 1.48438 seconds
David's likely: 1.40625 seconds
Chao's version: 3.12500 seconds
Unrolled loop: 2.95312 seconds
-794967296
```
I’m not suggesting adding likely() to your version, I’m just sharing the information for your reference and leaving the decision to you.
I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.
Yeah, but it's better to find bottlenecks and focus there than to
optimise blindly without knowing if it'll make any actual difference.
It's still important to test for regressions when modifying code and
try to avoid them, as it's sometimes not obvious if the code being
modified is critical for performance.I also don't think we should be doing anything to optimise for
multi-word sets at the expense of slowing down single-word sets.
Fully agreed.
Not
that it's very representative of the real world, but I added some
telemetry to bitmapset.c and I see 43 multi-word sets being created
and 1.45 million single-word sets created during make check, or 1 in
33785, if you prefer.diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..21b5053e9a2 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -219,6 +219,10 @@ bms_make_singleton(int x) int wordnum, bitnum;+ if (x >= 64) + elog(NOTICE, "multi-word set bms_make_singleton"); + else + elog(NOTICE, "single-word set bms_make_singleton"); if (x < 0) elog(ERROR, "negative bitmapset member not allowed"); wordnum = WORDNUM(x); @@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x) wordnum = WORDNUM(x); bitnum = BITNUM(x);+ if (x >= 64 && a->nwords == 1) + elog(NOTICE, "multi-set bms_add_member"); + /* enlarge the set if necessary */ if (wordnum >= a->nwords) {Not quite perfect as a set made first as a single word set that later
becomes a multi-word set will be double counted. The number of
operations on the sets is likely more important anyway, not the number
of sets being created. The point is, multi-word sets are rare for most
workloads.
What tests did you run after adding the logs to collect the data? This is a method I might borrow in the future for similar investigations.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/