Making empty Bitmapsets always be NULL

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Nathan Bossart (#2)

Re: Making empty Bitmapsets always be NULL

Nathan Bossart <nathandbossart@gmail.com> writes:

On Tue, Feb 28, 2023 at 04:59:48PM -0500, Tom Lane wrote:

When I designed the Bitmapset module, I set things up so that an empty
Bitmapset could be represented either by a NULL pointer, or by an
allocated object all of whose bits are zero. I've recently come to
the conclusion that that was a bad idea and we should instead have
a convention like the longstanding invariant for Lists: an empty
list is represented by NIL and nothing else.

+1

Thanks for looking at this.

Unless there is a way to avoid the invariant violation that doesn't involve
scanning the rest of the words before bms_first_member returns, +1 to
removing it. Perhaps we could add a num_members variable to the struct so
that we know right away when the set becomes empty. But maintaining that
might be more trouble than it's worth.

bms_first_member is definitely legacy code, so let's just get
rid of it. Done like that in 0001 below. (This was slightly more
complex than I foresaw, because some of the callers were modifying
the result variables. But they're probably cleaner this way anyway.)

I also discovered that nodeAppend.c is relying on bms_del_members
not reducing a non-empty set to NULL, because it uses the nullness
of appendstate->as_valid_subplans as a state boolean.

The separate boolean certainly seems less fragile. That might even be
worthwhile independent of the rest of the patch.

Yeah. I split out those executor fixes as 0002; 0003 is the changes
to bitmapsets proper, and then 0004 removes now-dead code.

regards, tom lane

Nathan Bossart

nathandbossart@gmail.com

about 3 years ago

In reply to: Tom Lane (#4)

Re: Making empty Bitmapsets always be NULL

On Wed, Mar 01, 2023 at 05:59:45PM -0500, Tom Lane wrote:

bms_first_member is definitely legacy code, so let's just get
rid of it. Done like that in 0001 below. (This was slightly more
complex than I foresaw, because some of the callers were modifying
the result variables. But they're probably cleaner this way anyway.)

Yeah, it's nice that you don't have to subtract
FirstLowInvalidHeapAttributeNumber in so many places. I think future
changes might end up using attidx when they ought to use attrnum (and vice
versa), but you could just as easily forget to subtract
FirstLowInvalidHeapAttributeNumber today, so it's probably fine.

+     /* attidx is zero-based, attrnum is the normal attribute number */
+     int         attrnum = attidx + FirstLowInvalidHeapAttributeNumber;

nitpick: Shouldn't this be an AttrNumber?

Yeah. I split out those executor fixes as 0002; 0003 is the changes
to bitmapsets proper, and then 0004 removes now-dead code.

These all looked reasonable to me.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Nathan Bossart (#5)

Re: Making empty Bitmapsets always be NULL

Nathan Bossart <nathandbossart@gmail.com> writes:

On Wed, Mar 01, 2023 at 05:59:45PM -0500, Tom Lane wrote:

+     /* attidx is zero-based, attrnum is the normal attribute number */
+     int         attrnum = attidx + FirstLowInvalidHeapAttributeNumber;

nitpick: Shouldn't this be an AttrNumber?

I stuck with the existing type choices for those variables,
but I don't mind changing to AttrNumber here.

regards, tom lane

Richard Guo

guofenglinux@gmail.com

about 3 years ago

In reply to: Tom Lane (#4)

Re: Making empty Bitmapsets always be NULL

On Thu, Mar 2, 2023 at 6:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah. I split out those executor fixes as 0002; 0003 is the changes
to bitmapsets proper, and then 0004 removes now-dead code.

+1 to all these patches. Some minor comments from me.

*0003
It seems that the Bitmapset checked by bms_is_empty_internal cannot be
NULL from how it is computed by a function. So I wonder if we can
remove the check of 'a' being NULL in that function, or reduce it to an
Assert.

-   if (a == NULL)
-       return true;
+   Assert(a != NULL);

*0004
It seems that in create_lateral_join_info around line 689, the
bms_is_empty check of lateral_relids is not necessary, since we've
checked that lateral_relids cannot be NULL several lines earlier.

@@ -682,12 +682,6 @@ create_lateral_join_info(PlannerInfo *root)
if (lateral_relids == NULL)
continue;

- /*
- * We should not have broken the invariant that lateral_relids is
- * exactly NULL if empty.
- */
- Assert(!bms_is_empty(lateral_relids));
-

Thanks
Richard

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: Richard Guo (#7)

Re: Making empty Bitmapsets always be NULL

Richard Guo <guofenglinux@gmail.com> writes:

It seems that the Bitmapset checked by bms_is_empty_internal cannot be
NULL from how it is computed by a function. So I wonder if we can
remove the check of 'a' being NULL in that function, or reduce it to an
Assert.

Yeah, I think just removing it is sufficient. The subsequent attempts
to dereference the pointer will crash just fine if it's NULL; we don't
need an Assert to help things along.

It seems that in create_lateral_join_info around line 689, the
bms_is_empty check of lateral_relids is not necessary, since we've
checked that lateral_relids cannot be NULL several lines earlier.

Good catch, I missed that one.

Pushed, thanks for reviewing.

regards, tom lane

dgrowleyml@gmail.com

about 3 years ago

In reply to: Tom Lane (#1)

Re: Making empty Bitmapsets always be NULL

On Wed, 1 Mar 2023 at 10:59, Tom Lane <tgl@sss.pgh.pa.us> wrote:

When I designed the Bitmapset module, I set things up so that an empty
Bitmapset could be represented either by a NULL pointer, or by an
allocated object all of whose bits are zero. I've recently come to
the conclusion that that was a bad idea and we should instead have
a convention like the longstanding invariant for Lists: an empty
list is represented by NIL and nothing else.

I know I'm late to the party here, but I just wanted to add that I
agree with this and also that I've never been a fan of having to
decide if it was safe to check for NULL when I needed performance or
if I needed to use bms_is_empty() because the set might have all its
words set to 0.

I suggest tightening the rule even further so instead of just empty
sets having to be represented as NULL, the rule should be that sets
should never contain any trailing zero words, which is effectively a
superset of the "empty is NULL" rule that you've just changed.

If we did this, then various functions can shake loose some crufty
code and various other functions become more optimal due to not having
to loop over trailing zero words needlessly. For example.

* bms_equal() and bms_compare() can precheck nwords match before
troubling themselves with looping over each member, and;
* bms_is_subset() can return false early if 'a' has more words than 'b', and;
* bms_subset_compare() becomes more simple as it does not have to look
for trailing 0 words, and;
* bms_nonempty_difference() can return true early if 'a' has more
words than 'b', plus no need to check for trailing zero words at the
end.

We can also chop the set down to size in; bms_intersect(),
bms_difference(), bms_int_members(), bms_del_members() and
bms_intersect() which saves looping needlessly over empty words when
various other BMS operations are performed later on the set, for
example, bms_next_member(), bms_prev_member, bms_copy(), etc.

The only reason I can think of that this would be a bad idea is that
if we want to add members again then we need to do repalloc(). If
we're only increasing the nwords back to what it had been on some
previous occasion then repalloc() is effectively a no-op, so I doubt
this will really be a net negative. I think the effort we'll waste by
carrying around needless trailing zero words in most cases is likely
to outweigh the overhead of any no-op repallocs. Take
bms_int_members(), for example, we'll purposefully 0 out all the
trailing words possibly having to read in new cachelines from RAM to
do so. It would be better to leave having to read those in again
until we actually need to do something more useful with them, like
adding some new members to the set again. We'll have to dirty those
cachelines then anyway and we may have flushed those cachelines out of
the CPU cache by the time we get around to adding the new members
again.

I've coded this up in the attached and followed the lead in list.c and
added a function named check_bitmapset_invariants() which ensures the
above rule is followed. I think the code as it stands today should
really have something like that anyway.

The patch also optimizes sub-optimal newly added code which calls
bms_is_empty_internal() when we have other more optimal means to
determine if the set is empty or not.

David

#10

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: David Rowley (#9)

Re: Making empty Bitmapsets always be NULL

David Rowley <dgrowleyml@gmail.com> writes:

I suggest tightening the rule even further so instead of just empty
sets having to be represented as NULL, the rule should be that sets
should never contain any trailing zero words, which is effectively a
superset of the "empty is NULL" rule that you've just changed.

Hmm, I'm not immediately a fan of that, because it'd mean more
interaction with aset.c to change the allocated size of results.
(Is it worth carrying both "allocated words" and "nonzero words"
fields to avoid useless memory-management effort? Dunno.)

Another point here is that I'm pretty sure that just about all
bitmapsets we deal with are only one or two words, so I'm not
convinced you're going to get any performance win to justify
the added management overhead.

regards, tom lane

#11

dgrowleyml@gmail.com

about 3 years ago

In reply to: Tom Lane (#10)

Re: Making empty Bitmapsets always be NULL

On Fri, 3 Mar 2023 at 15:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(Is it worth carrying both "allocated words" and "nonzero words"
fields to avoid useless memory-management effort? Dunno.)

It would have been a more appealing thing to do before Bitmapset
became a node type as we'd have had empty space in the struct to have
another 32-bit word on 64-bit builds.

Another point here is that I'm pretty sure that just about all
bitmapsets we deal with are only one or two words, so I'm not
convinced you're going to get any performance win to justify
the added management overhead.

It's true that the majority of Bitmapsets are going to be just 1 word,
but it's important to acknowledge that we do suffer in some more
extreme cases when Bitmapsets become large. Partition with large
numbers of partitions is one such case.

create table lp(a int) partition by list(a);
select 'create table lp'||x||' partition of lp for values
in('||x||');' from generate_series(0,9999)x;
\gexec

# cat bench.sql
select * from lp where a > 1 and a < 3;

$ pgbench -n -T 15 -f bench.sql postgres | grep tps

master:
tps = 28055.619289 (without initial connection time)
tps = 27819.235083 (without initial connection time)
tps = 28486.099808 (without initial connection time)

master + bms_no_trailing_zero_words.patch:
tps = 30840.840266 (without initial connection time)
tps = 29491.519705 (without initial connection time)
tps = 29471.083938 (without initial connection time)

(~6.45% faster)

Of course, it's an extreme case, I'm merely trying to show that
trimming the Bitmapsets down can have an impact in some cases.

David

#12

tgl@sss.pgh.pa.us

about 3 years ago

In reply to: David Rowley (#11)

Re: Making empty Bitmapsets always be NULL

David Rowley <dgrowleyml@gmail.com> writes:

On Fri, 3 Mar 2023 at 15:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Another point here is that I'm pretty sure that just about all
bitmapsets we deal with are only one or two words, so I'm not
convinced you're going to get any performance win to justify
the added management overhead.

It's true that the majority of Bitmapsets are going to be just 1 word,
but it's important to acknowledge that we do suffer in some more
extreme cases when Bitmapsets become large. Partition with large
numbers of partitions is one such case.

Maybe, but optimizing for that while pessimizing every other case
doesn't sound very attractive from here. I think we need some
benchmarks on normal-size bitmapsets before considering doing much
in this area.

Also, if we're going to make any sort of changes here it'd probably
behoove us to make struct Bitmapset private in bitmapset.c, so that
we can have confidence that nobody is playing games with them.
I had a go at that and was pleasantly surprised to find that
actually nobody has; the attached passes check-world. It'd probably
be smart to commit this as a follow-on to 00b41463c, whether or not
we go any further.

Also, given that we do this, I don't think that check_bitmapset_invariants
as you propose it is worth the trouble. The reason we've gone to such
lengths with checking List invariants is that initially we had a large
number of places doing creative and not-too-structured things with Lists,
plus we've made several absolutely fundamental changes to that data
structure. Despite the far larger bug surface, I don't recall that those
invariant checks ever found anything after the initial rounds of changes.
So I don't buy that there's an argument for a similarly expensive set
of checks here. bitmapset.c is small enough that we should be able to
pretty much prove it correct by eyeball.

regards, tom lane

#13

dgrowleyml@gmail.com

about 3 years ago

In reply to: Tom Lane (#12)

Re: Making empty Bitmapsets always be NULL

On Sat, 4 Mar 2023 at 11:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

On Fri, 3 Mar 2023 at 15:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It's true that the majority of Bitmapsets are going to be just 1 word,
but it's important to acknowledge that we do suffer in some more
extreme cases when Bitmapsets become large. Partition with large
numbers of partitions is one such case.

Maybe, but optimizing for that while pessimizing every other case
doesn't sound very attractive from here. I think we need some
benchmarks on normal-size bitmapsets before considering doing much
in this area.

After thinking about this again and looking at the code, I'm not
really sure where the pessimism has been added. For the changes made
to say bms_equal(), there was already a branch that checked the nwords
column so that we could find the shorter and longer sets out of the
two input sets. That's been replaced with a comparison of both input
set's nwords, which does not really seem any more expensive. For
bms_compare() we needed to do Min(a->nwords, b->nwords) to find the
shortest set, likewise for bms_nonempty_difference() and
bms_is_subset(). That does not seem less expensive than the
replacement code.

I think the times where we have sets that we do manage to trim down
the nword count with that we actually end up having to expand again
are likely fairly rare.

I also wondered if we could shave off a few instructions by utilising
the knowledge that nwords is never 0. That would mean that some of
the loops could be written as:

i = 0; do { <stuff>; } while (++i < set->nwords);

instead of:

for (i = 0; i < set->nwords; i++) { <stuff>; }

if we do assume that the vast majority of sets are nwords==1 sets,
then this reduces the loop condition checks by half for all those.

I see that gcc manages to optimize: for (i = 0; i < set->nwords || i
== 0; i++) { <stuff>; } into the same code as the do while loop. clang
does not seem to manage that.

Also, if we're going to make any sort of changes here it'd probably
behoove us to make struct Bitmapset private in bitmapset.c, so that
we can have confidence that nobody is playing games with them.
I had a go at that and was pleasantly surprised to find that
actually nobody has; the attached passes check-world. It'd probably
be smart to commit this as a follow-on to 00b41463c, whether or not
we go any further.

That seems like a good idea. This will give us extra reassurance that
nobody is making up their own Bitmapsets through some custom function
that don't follow the rules.

Also, given that we do this, I don't think that check_bitmapset_invariants
as you propose it is worth the trouble.

I wondered if maybe just Assert(a == NULL || a->words[a->nwords - 1]
!= 0); would be worthwhile anywhere. However, I don't see any
particular function that is more likely to catch those errors, so it's
maybe not worth doing anywhere if we're not doing it everywhere.

I adjusted the patch to remove the invariant checks and fixed up a
couple of things I'd missed. The 0002 patch changes the for loops
into do while loops. I wanted to see if we could see any performance
gains from doing this.

The performance numbers are nowhere near as stable as I'd like them to
have been, but testing the patch shows:

Test 1:

setup:
create table t1 (a int) partition by list(a);
select 'create table t1_'||x||' partition of t1 for values
in('||x||');' from generate_series(0,9)x;
\gexec

Test 1's sql:
select * from t1 where a > 1 and a < 3;

for i in {1..3}; do pgbench -n -f test1.sql -T 15 postgres | grep tps; done

master (cf96907aad):
tps = 29534.189309
tps = 30465.722545
tps = 30328.290553

master + 0001:
tps = 28915.174536
tps = 29817.950994
tps = 29387.084581

master + 0001 + 0002:
tps = 29438.216512
tps = 29951.905408
tps = 31445.191414

Test 2:

setup:
create table t2 (a int) partition by list(a);
select 'create table t2_'||x||' partition of t2 for values
in('||x||');' from generate_series(0,9999)x;
\gexec

Test 2's sql:
select * from t2 where a > 1 and a < 3;

for i in {1..3}; do pgbench -n -f test2.sql -T 15 postgres | grep tps; done

master (cf96907aad):
tps = 28470.504990
tps = 29175.450905
tps = 28123.699176

master + 0001:
tps = 28056.256805
tps = 28380.401746
tps = 28384.395217

master + 0001 + 0002:
tps = 29365.992219
tps = 28418.374923
tps = 28303.924129

Test 3:

setup:
create table t3a (a int primary key);
create table t3b (a int primary key);

Test 3's sql:
select * from t3a inner join t3b on t3a.a = t3b.a;

for i in {1..3}; do pgbench -n -f test3.sql -T 15 postgres | grep tps; done

master (cf96907aad):
tps = 20458.710550
tps = 20527.898929
tps = 20284.165277

master + 0001:
tps = 20700.340713
tps = 20571.913956
tps = 20541.771589

master + 0001 + 0002:
tps = 20046.674601
tps = 20016.649536
tps = 19487.999853

I've attached a graph of this too. It shows that there might be a
small increase in performance with tests 1 and 2. It seems like test 3
regresses a bit. I suspect this might just be a code arrangement issue
as master + 0001 is faster than 0001 + 0002 for that test.

David

#14

[1]: /messages/by-id/CAJ2pMkY10J_PA2jpH5M-VoOo6BvJnTOO_-t_znu_pOaP0q10pA@mail.gmail.com
[2]: /messages/by-id/CAApHDvq9eq0W_aFUGrb6ba28ieuQN4zM5Uwqxy7+LMZjJc+VGg@mail.gmail.com

watari.yuya@gmail.com

about 3 years ago

In reply to: David Rowley (#13)

Re: Making empty Bitmapsets always be NULL

Hello,

On Tue, Mar 7, 2023 at 1:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

I adjusted the patch to remove the invariant checks and fixed up a
couple of things I'd missed. The 0002 patch changes the for loops
into do while loops. I wanted to see if we could see any performance
gains from doing this.

I noticed that these patches caused significant degradation while
working on improving planning performance in another thread [1]/messages/by-id/CAJ2pMkY10J_PA2jpH5M-VoOo6BvJnTOO_-t_znu_pOaP0q10pA@mail.gmail.com.

In the experiment, I used the query attached to this email. This
workload consists of eight tables, each of which is split into n
partitions. The "query.sql" measures the planning time of a query that
joins these tables. You can quickly reproduce my experiment using the
following commands.

=====
psql -f create-tables.sql
psql -f query.sql
=====

I show the result in the following tables. I refer to David's patches
in [2]/messages/by-id/CAApHDvq9eq0W_aFUGrb6ba28ieuQN4zM5Uwqxy7+LMZjJc+VGg@mail.gmail.com as the "trailing-zero" patch. When n was large, the
trailing-zero patch showed significant degradation. This is due to too
many calls of repalloc(). With this patch, we cannot reuse spaces
after the last non-zero bitmapword, so we need to call repalloc() more
frequently than before. When n is 256, repalloc() was called only 670
times without the patch, while it was called 5694 times with the
patch.

Table 1: Planning time (ms)
-----------------------------------------------------------------
n | Master | Patched (trailing-zero) | Patched (bitwise-OR)
-----------------------------------------------------------------
1 | 37.639 | 37.330 | 36.979
2 | 36.066 | 35.646 | 36.044
4 | 37.958 | 37.349 | 37.842
8 | 42.397 | 42.994 | 39.779
16 | 54.565 | 67.713 | 44.186
32 | 89.220 | 100.828 | 65.542
64 | 227.854 | 269.059 | 150.398
128 | 896.513 | 1279.965 | 577.671
256 | 4241.994 | 8220.508 | 2538.681
-----------------------------------------------------------------

Table 2: Planning time speedup (higher is better)
------------------------------------------------------
n | Patched (trailing-zero) | Patched (bitwise-OR)
------------------------------------------------------
1 | 100.8% | 101.8%
2 | 101.2% | 100.1%
4 | 101.6% | 100.3%
8 | 98.6% | 106.6%
16 | 80.6% | 123.5%
32 | 88.5% | 136.1%
64 | 84.7% | 151.5%
128 | 70.0% | 155.2%
256 | 51.6% | 167.1%
------------------------------------------------------

On Fri, Mar 3, 2023 at 10:52 AM David Rowley <dgrowleyml@gmail.com> wrote:

The patch also optimizes sub-optimal newly added code which calls
bms_is_empty_internal() when we have other more optimal means to
determine if the set is empty or not.

However, I agree with David's opinion regarding the
bms_is_empty_internal() calls, which is quoted above. I have
implemented this optimization in a slightly different way than David.
My patch is attached to this email. The difference between my patch
and David's is in the determination method of whether the result is
empty: David's patch records the last index of non-zero bitmapword to
minimize the Bitmapset. If the index is -1, we can conclude that the
result is empty. In contrast, my patch uses a more lightweight
operation. I show my changes as follows.

=====
@@ -263,6 +261,7 @@ bms_intersect(const Bitmapset *a, const Bitmapset *b)
const Bitmapset *other;
int resultlen;
int i;
+ bitmapword bitwise_or = 0;

     /* Handle cases where either input is NULL */
     if (a == NULL || b == NULL)
@@ -281,9 +280,17 @@ bms_intersect(const Bitmapset *a, const Bitmapset *b)
     /* And intersect the longer input with the result */
     resultlen = result->nwords;
     for (i = 0; i < resultlen; i++)
-        result->words[i] &= other->words[i];
+    {
+        bitmapword    w = (result->words[i] &= other->words[i]);
+
+        /*
+         * Compute bitwise OR of all bitmapwords to determine if the result
+         * is empty
+         */
+        bitwise_or |= w;
+    }
     /* If we computed an empty result, we must return NULL */
-    if (bms_is_empty_internal(result))
+    if (bitwise_or == 0)
     {
         pfree(result);
         return NULL;
@@ -711,30 +718,6 @@ bms_membership(const Bitmapset *a)
     return result;
 }
=====

My idea is to compute the bitwise OR of all bitmapwords of the result
Bitmapset. The bitwise OR can be represented as a single operation in
the machine code and does not require any conditional branches. If the
bitwise ORed value is zero, we can conclude the result Bitmapset is
empty. The costs related to this operation can be almost negligible;
it is significantly cheaper than calling bms_is_empty_internal() and
less expensive than using a conditional branch such as 'if.'

In the tables above, I called my patch the "bitwise-OR" patch. The
patch is much faster than the master when n is large. Its speed up
reached 167.1%. I think just adopting this optimization is worth
considering.

--
Best regards,
Yuya Watari

#15

watari.yuya@gmail.com

about 3 years ago

In reply to: Yuya Watari (#14)

Re: Making empty Bitmapsets always be NULL

Hello,

On Thu, Mar 16, 2023 at 10:30 AM Yuya Watari <watari.yuya@gmail.com> wrote:

My idea is to compute the bitwise OR of all bitmapwords of the result
Bitmapset. The bitwise OR can be represented as a single operation in
the machine code and does not require any conditional branches. If the
bitwise ORed value is zero, we can conclude the result Bitmapset is
empty. The costs related to this operation can be almost negligible;
it is significantly cheaper than calling bms_is_empty_internal() and
less expensive than using a conditional branch such as 'if.'

After posting the patch, I noticed that my patch had some bugs. My
idea above is not applicable to bms_del_member(), and I missed some
additional operations required in bms_del_members(). I have attached
the fixed version to this email. I really apologize for making the
mistakes. Should we add new regression tests to prevent this kind of
bug?

The following tables illustrate the result of a re-run experiment. The
significant improvement was a mistake, but a speedup of about 2% was
still obtained when the number of partitions, namely n, was large.
This result indicates that the optimization regarding
bms_is_empty_internal() is effective on some workloads.

Table 1: Planning time (ms)
(n: the number of partitions of each table)
-----------------------------------------------------------------
n | Master | Patched (trailing-zero) | Patched (bitwise-OR)
-----------------------------------------------------------------
1 | 36.903 | 36.621 | 36.731
2 | 35.842 | 35.031 | 35.704
4 | 37.756 | 37.457 | 37.409
8 | 42.069 | 42.578 | 42.322
16 | 53.670 | 67.792 | 53.618
32 | 88.412 | 100.605 | 89.147
64 | 229.734 | 271.259 | 225.971
128 | 889.367 | 1272.270 | 870.472
256 | 4209.312 | 8223.623 | 4129.594
-----------------------------------------------------------------

Table 2: Planning time speedup (higher is better)
------------------------------------------------------
n | Patched (trailing-zero) | Patched (bitwise-OR)
------------------------------------------------------
1 | 100.8% | 100.5%
2 | 102.3% | 100.4%
4 | 100.8% | 100.9%
8 | 98.8% | 99.4%
16 | 79.2% | 100.1%
32 | 87.9% | 99.2%
64 | 84.7% | 101.7%
128 | 69.9% | 102.2%
256 | 51.2% | 101.9%
------------------------------------------------------

--
Best regards,
Yuya Watari

#16

watari.yuya@gmail.com

almost 3 years ago

In reply to: Yuya Watari (#15)

Re: Making empty Bitmapsets always be NULL

Hello,

On Tue, Mar 7, 2023 at 1:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

I adjusted the patch to remove the invariant checks and fixed up a
couple of things I'd missed. The 0002 patch changes the for loops
into do while loops. I wanted to see if we could see any performance
gains from doing this.

In March, I reported that David's patch caused a degradation in
planning performance. I have investigated this issue further and found
some bugs in the patch. Due to these bugs, Bitmapset operations in the
original patch computed incorrect results. This incorrect computation
resulted in unexpected behavior, which I observed as performance
degradation. After fixing the bugs, David's patch showed significant
performance improvements. In particular, it is worth noting that the
patch obtained a good speedup even when most Bitmapsets have only one
word.

1.1. Wrong truncation that we should not do (fixed in v2-0003)

The first bug is in bms_difference() and bms_del_members(). At the end
of these functions, the original patch truncated the result Bitmapset
when lastnonzero was -1. However, we must not do this when the result
Bitmapset is longer than the other. In such a case, the last word of
the result was still non-zero, so we cannot shorten nwords. I fixed
this bug in v2-0003.

1.2. Missing truncation that we should do (fixed in v2-0004)

The other bug is in bms_del_member(). As seen from v2-0004-*.patch,
the original patch missed the necessary truncation. I also fixed this
bug.

2. Experiments

I conducted experiments to evaluate the performance of David's patch
with bug fixes. In the experiments, I used two queries attached to
this email. The first query, Query A (query-a.sql), joins three tables
and performs an aggregation. This is quite a simple query. The second
query, Query B (query-b.sql), is more complicated because it joins
eight tables. In both queries, every table is split into n partitions.
I issued these queries with varying n and measured their planning
times. The following tables and attached figure show the results.

Table 1: Planning time and its speedup of Query A
(n: the number of partitions of each table)
(Speedup: higher is better)
---------------------------------------------
n | Master (ms) | Patched (ms) | Speedup
---------------------------------------------
1 | 0.722 | 0.682 | 105.8%
2 | 0.779 | 0.774 | 100.6%
4 | 0.977 | 0.958 | 101.9%
8 | 1.286 | 1.287 | 99.9%
16 | 1.993 | 1.986 | 100.4%
32 | 3.967 | 3.900 | 101.7%
64 | 7.783 | 7.310 | 106.5%
128 | 23.369 | 19.722 | 118.5%
256 | 108.723 | 75.149 | 144.7%
384 | 265.576 | 167.354 | 158.7%
512 | 516.468 | 301.100 | 171.5%
640 | 883.167 | 494.960 | 178.4%
768 | 1423.839 | 755.201 | 188.5%
896 | 2195.935 | 1127.786 | 194.7%
1024 | 3041.131 | 1444.145 | 210.6%
---------------------------------------------

Table 2: Planning time and its speedup of Query B
--------------------------------------------
n | Master (ms) | Patched (ms) | Speedup
--------------------------------------------
1 | 36.038 | 35.455 | 101.6%
2 | 34.831 | 34.178 | 101.9%
4 | 36.537 | 35.998 | 101.5%
8 | 41.234 | 40.333 | 102.2%
16 | 52.427 | 50.596 | 103.6%
32 | 87.064 | 80.013 | 108.8%
64 | 228.050 | 187.762 | 121.5%
128 | 886.140 | 645.731 | 137.2%
256 | 4212.709 | 2853.072 | 147.7%
--------------------------------------------

You can quickly reproduce my experiments by the following commands.

== Query A ==
psql -f create-tables-a.sql
psql -f query-a.sql
=============

== Query B ==
psql -f create-tables-b.sql
psql -f query-b.sql
=============

The above results indicate that David's patch demonstrated outstanding
performance. The speedup reached 210.6% for Query A and 147.7% for
Query B. Even when n is small, the patch reduced planning time. The
main concern about this patch was overheads for Bitmapsets with only
one or two words. My experiments imply that such overheads are
non-existent or negligible because some performance improvements were
obtained even for small sizes.

The results of my experiments strongly support the effectiveness of
David's patch. I think this optimization is worth considering.

I am looking forward to your comments.

--
Best regards,
Yuya Watari

#17

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Yuya Watari (#16)

Re: Making empty Bitmapsets always be NULL

On Tue, 13 Jun 2023 at 00:32, Yuya Watari <watari.yuya@gmail.com> wrote:

In March, I reported that David's patch caused a degradation in
planning performance. I have investigated this issue further and found
some bugs in the patch. Due to these bugs, Bitmapset operations in the
original patch computed incorrect results. This incorrect computation
resulted in unexpected behavior, which I observed as performance
degradation. After fixing the bugs, David's patch showed significant
performance improvements. In particular, it is worth noting that the
patch obtained a good speedup even when most Bitmapsets have only one
word.

Thank you for looking at this again and finding and fixing the two
bugs and running some benchmarks.

I've incorporated fixes for the bugs in the attached patch. I didn't
quite use the same approach as you did. I did the fix for 0003
slightly differently and added two separate paths. We've no need to
track the last non-zero word when 'a' has more words than 'b' since we
can't truncate any zero-words off for that case. Not having to do
that makes the do/while loop pretty tight.

For the fix in the 0004 patch, I think we can do what you did more
simply. I don't think there's any need to perform the loop to find
the last non-zero word. We're only deleting a member from a single
word here, so we only need to check if that word is the last word and
remove it if it's become zero. If it's not the last word then we
can't remove it as there must be some other non-zero word after it.

I also made a small adjustment to bms_get_singleton_member() and
bms_singleton_member() to have them Assert fail if result is < 0 after
looping over the set. This should no longer happen so I thought it
would make more compact code if that check was just removed. We'd
likely do better if we got reports of Assert failures here than, in
the case of bms_get_singleton_member, some code accidentally doing the
wrong thing based on a corrupt Bitmapset.

David

#18

watari.yuya@gmail.com

almost 3 years ago

In reply to: David Rowley (#17)

Re: Making empty Bitmapsets always be NULL

Hello,

On Tue, Jun 13, 2023 at 8:07 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've incorporated fixes for the bugs in the attached patch. I didn't
quite use the same approach as you did. I did the fix for 0003
slightly differently and added two separate paths. We've no need to
track the last non-zero word when 'a' has more words than 'b' since we
can't truncate any zero-words off for that case. Not having to do
that makes the do/while loop pretty tight.

I really appreciate your quick response and incorporating those fixes
into your patch. The fix for 0003 looks good to me. I believe your
change improves performance more.

For the fix in the 0004 patch, I think we can do what you did more
simply. I don't think there's any need to perform the loop to find
the last non-zero word. We're only deleting a member from a single
word here, so we only need to check if that word is the last word and
remove it if it's become zero. If it's not the last word then we
can't remove it as there must be some other non-zero word after it.

If my thinking is correct, the do-while loop I added is still
necessary. Consider the following code. The Assertion in this code
passes in the master but fails in the new patch.

=====
Bitmapset *x = bms_make_singleton(1000);

x = bms_del_member(x, 1000);
Assert(x == NULL);
=====

In the code above, we get a new Bitmapset by bms_make_singleton(1000).
This Bitmapset has many words. Only the last word is non-zero, and all
the rest are zero. If we call bms_del_member(x, 1000) for the
Bitmapset, all words of the result will be zero, including the last
word, so we must return NULL. However, the new patch truncates only
the last word, leading to an incorrect result. Therefore, we need to
perform the loop to find the actual non-zero word after the deletion.
Of course, I agree that if we are not modifying the last word, we
don't have to truncate anything, so we can omit the loop.

I also made a small adjustment to bms_get_singleton_member() and
bms_singleton_member() to have them Assert fail if result is < 0 after
looping over the set. This should no longer happen so I thought it
would make more compact code if that check was just removed. We'd
likely do better if we got reports of Assert failures here than, in
the case of bms_get_singleton_member, some code accidentally doing the
wrong thing based on a corrupt Bitmapset.

I agree with your change. I think failing by Assertion is better than
a runtime error or unexpected behavior.

--
Best regards,
Yuya Watari

#19