[PoC] Reducing planning time when tables have many partitions

Started by Yuya Watarialmost 4 years ago143 messages
Jump to latest
#1Yuya Watari
watari.yuya@gmail.com

Hello,

I found a problem that planning takes too much time when the tables
have many child partitions. According to my observation, the planning
time increases in the order of O(n^2). Here, n is the number of child
partitions. I attached the patch to solve this problem. Please be
noted that this patch is a PoC.

1. Problem Statement

The problem arises in the next simple query. This query is modeled
after a university's grade system, joining tables about students,
scores, and their GPAs to output academic records for each student.

=====
SELECT students.name, gpas.gpa AS gpa, sum(scores.score) AS total_score
FROM students, scores, gpas
WHERE students.id = scores.student_id AND students.id = gpas.student_id
GROUP BY students.id, gpas.student_id;
=====

Here, since there are so many students enrolled in the university, we
will partition each table. If so, the planning time of the above query
increases very rapidly as the number of partitions increases.

I conducted an experiment by varying the number of partitions of three
tables (students, scores, and gpas) from 2 to 1024. The attached
figure illustrates the result. The blue line annotated with "master"
stands for the result on the master branch. Obviously, its
computational complexity is large.

I attached SQL files to this e-mail as "sample-queries.zip". You can
reproduce my experiment by the next steps:
=====
$ unzip sample-queries.zip
$ cd sample-queries
# Create tables and insert sample data ('n' denotes the number of partitions)
$ psql -f create-table-n.sql
# Measure planning time
$ psql -f query-n.sql
=====

2. Where is Slow?

In order to identify bottlenecks, I ran a performance profiler(perf).
The "perf-master.png" is a call graph of planning of query-1024.sql.

From this figure, it can be seen that "bms_equal" and "bms_is_subset"
take up most of the running time. Most of these functions are called
when enumerating EquivalenceMembers in EquivalenceClass. The
enumerations exist in src/backend/optimizer/path/equivclass.c and have
the following form.

=====
EquivalenceClass *ec = /* given */;

EquivalenceMember *em;
ListCell *lc;
foreach(lc, ec->ec_members)
{
em = (EquivalenceMember *) lfirst(lc);

/* predicate is bms_equal or bms_is_subset, etc */
if (!predicate(em))
continue;

/* The predicate satisfies */
do something...;
}
=====

This foreach loop is a linear search, whose cost will become very high
when there are many EquivalenceMembers in ec_members. This is the case
when the number of partitions is large. Eliminating this heavy linear
search is a key to improving planning performance.

3. How to Solve?

In my patch, I made three different optimizations depending on the
predicate pattern.

3.1 When the predicate is "!em->em_is_child"

In equivclass.c, there are several processes performed when
em_is_child is false. If a table has many partitions, the number of
EquivalenceMembers which are not children is limited. Therefore, it is
useful to keep only the non-child members as a list in advance.

My patch adds the "ec_not_child_members" field to EquivalenceClass.
This field is a List containing non-child members. Taking advantage of
this, the previous loop can be rewritten as follows:

=====
foreach(lc, ec->ec_not_child_members)
{
em = (EquivalenceMember *) lfirst(lc);
Assert(!em->em_is_child);
do something...;
}
=====

3.2 When the predicate is "bms_equal(em->em_relids, relids)"

"bms_equal" is another example of the predicate. In this case,
processes will be done when the "em_relids" matches certain Relids.

This type of loop can be quickly handled by utilizing a hash table.
First, group EquivalenceMembers with the same Relids into a list.
Then, create an associative array whose key is Relids and whose value
is the list. In my patch, I added the "ec_members_htab" field to
EquivalenceClass, which plays a role of an associative array.

Based on this idea, the previous loop is transformed as follows. Here,
the FindEcMembersMatchingRelids function looks up the hash table and
returns the corresponding value, which is a list.
=====
foreach(lc, FindEcMembersMatchingRelids(ec, relids))
{
em = (EquivalenceMember *) lfirst(lc);
Assert(bms_equal(em->em_relids, relids));
do something...;
}
=====

3.3 When the predicate is "bms_is_subset(em->em_relids, relids)"

There are several processings performed on EquivalenceMembers whose
em_relids is a subset of the given "relids". In this case, the
predicate is "bms_is_subset". Optimizing this search is not as easy as
with bms_equal, but the technique above by hash tables can be applied.

There are 2^m subsets if the number of elements of the "relids" is m.
The key here is that m is not so large in most cases. For example, m
is up to 3 in the sample query, meaning that the number of subsets is
at most 2^3=8. Therefore, we can enumerate all subsets within a
realistic time. Looking up the hash table with each subset as a key
will drastically reduce unnecessary searches. My patch's optimization
is based on this notion.

This technique can be illustrated as the next pseudo-code. The code
iterates over all subsets and looks up the corresponding
EquivalenceMembers from the hash table. The actual code is more
complicated for performance reasons.

===
EquivalenceClass *ec = /* given */;
Relids relids = /* given */;

int num_members_in_relids = bms_num_members(relids);
for (int bit = 0; bit < (1 << num_members_in_relids); bit++)
{
EquivalenceMember *em;
ListCell *lc;
Relids subset = construct subset from 'bit';

foreach(lc, FindEcMembersMatchingRelids(ec, subset))
{
em = (EquivalenceMember *) lfirst(lc);
Assert(bms_is_subset(em->em_relids, relids));
do something...;
}
}
===

4. Experimental Result

The red line in the attached figure is the planning time with my
patch. The chart indicates that planning performance has been greatly
improved. The exact values are shown below.

Planning time of "query-n.sql" (n = number of partitions):
n | Master (s) | Patched (s) | Speed up
------------------------------------------
2 | 0.003 | 0.003 | 0.9%
4 | 0.004 | 0.004 | 1.0%
8 | 0.006 | 0.006 | 4.6%
16 | 0.011 | 0.010 | 5.3%
32 | 0.017 | 0.016 | 4.7%
64 | 0.032 | 0.030 | 8.0%
128 | 0.073 | 0.060 | 17.7%
256 | 0.216 | 0.142 | 34.2%
384 | 0.504 | 0.272 | 46.1%
512 | 0.933 | 0.462 | 50.4%
640 | 1.529 | 0.678 | 55.7%
768 | 2.316 | 1.006 | 56.6%
896 | 3.280 | 1.363 | 58.5%
1024 | 4.599 | 1.770 | 61.5%

With 1024 partitions, the planning time was reduced by 61.5%. Besides,
with 128 partitions, which is a realistic use case, the performance
increased by 17.7%.

5. Things to Be Discussed

5.1 Regressions

While my approach is effective for tables with a large number of
partitions, it may cause performance degradation otherwise. For small
cases, it is necessary to switch to a conventional algorithm. However,
its threshold is not self-evident.

5.2 Enumeration order

My patch may change the order in which members are enumerated. This
affects generated plans.

5.3 Code Quality

Source code quality should be improved.

=====

Again, I posted this patch as a PoC. I would appreciate it if you
would discuss the effectiveness of these optimizations with me.

Best regards,
Yuya Watari

Attachments:

v1-reducing-planning-time-when-tables-have-many-partitions.patchapplication/octet-stream; name=v1-reducing-planning-time-when-tables-have-many-partitions.patchDownload+667-34
figure.pngimage/png; name=figure.pngDownload+1-2
sample-queries.zipapplication/x-zip-compressed; name=sample-queries.zipDownload
perf-master.pngimage/png; name=perf-master.pngDownload+3-4
perf-patched.pngimage/png; name=perf-patched.pngDownload+6-6
#2David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#1)
Re: [PoC] Reducing planning time when tables have many partitions

On Fri, 18 Mar 2022 at 23:32, Yuya Watari <watari.yuya@gmail.com> wrote:

I found a problem that planning takes too much time when the tables
have many child partitions. According to my observation, the planning
time increases in the order of O(n^2). Here, n is the number of child
partitions. I attached the patch to solve this problem. Please be
noted that this patch is a PoC.

3. How to Solve?

I think a better way to solve this would be just to have a single hash
table over all EquivalenceClasses that allows fast lookups of
EquivalenceMember->em_expr. I think there's no reason that a given
Expr should appear in more than one non-merged EquivalenceClass. The
EquivalenceClass a given Expr belongs to would need to be updated
during the merge process.

For functions such as get_eclass_for_sort_expr() and
process_equivalence(), that would become a fairly fast hashtable
lookup instead of having nested loops to find if an EquivalenceMember
already exists for the given Expr. We might not want to build the hash
table for all queries. Maybe we could just do it if we get to
something like ~16 EquivalenceMember in total.

As of now, we don't have any means to hash Exprs, so all that
infrastructure would need to be built first. Peter Eisentraut is
working on a patch [1]https://commitfest.postgresql.org/37/3182/ which is a step towards having this.

Here's a simple setup to show the pain of this problem:

create table lp (a int, b int) partition by list(a);
select 'create table lp'||x::text|| ' partition of lp for values
in('||x::text||');' from generate_Series(0,4095)x;
\gexec
explain analyze select * from lp where a=b order by a;

Planning Time: 510.248 ms
Execution Time: 264.659 ms

David

[1]: https://commitfest.postgresql.org/37/3182/

#3Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#2)
Re: [PoC] Reducing planning time when tables have many partitions

Dear David,

Thank you for your comments on my patch. I really apologize for my
late response.

On Thu, Mar 24, 2022 at 11:03 AM David Rowley <dgrowleyml@gmail.com> wrote:

I think a better way to solve this would be just to have a single hash
table over all EquivalenceClasses that allows fast lookups of
EquivalenceMember->em_expr. I think there's no reason that a given
Expr should appear in more than one non-merged EquivalenceClass. The
EquivalenceClass a given Expr belongs to would need to be updated
during the merge process.

Thank you for your idea. However, I think building a hash table whose
key is EquivalenceMember->em_expr does not work for this case.

What I am trying to optimize in this patch is the following code.

=====
EquivalenceClass *ec = /* given */;

EquivalenceMember *em;
ListCell *lc;
foreach(lc, ec->ec_members)
{
em = (EquivalenceMember *) lfirst(lc);

/* predicate is bms_equal or bms_is_subset, etc */
if (!predicate(em))
continue;

/* The predicate satisfies */
do something...;
}
=====

From my observation, the predicates above will be false in most cases
and the subsequent processes are not executed. My optimization is
based on this notion and utilizes hash tables to eliminate calls of
predicates.

If the predicate were "em->em_expr == something", the hash table whose
key is em_expr would be effective. However, the actual predicates are
not of this type but the following.

// Find EquivalenceMembers whose relids is equal to the given relids
(1) bms_equal(em->em_relids, relids)

// Find EquivalenceMembers whose relids is a subset of the given relids
(2) bms_is_subset(em->em_relids, relids)

Since these predicates perform a match search for not em_expr but
em_relids, we need to build a hash table with em_relids as key. If so,
we can drastically reduce the planning time for the pattern (1).
Besides, by enumerating all subsets of relids, pattern (2) can be
optimized. The detailed algorithm is described in the first email.

I show an example of the pattern (1). The next code is in
src/backend/optimizer/path/equivclass.c. As can be seen from this
code, the foreach loop tries to find an EquivalenceMember whose
cur_em->em_relids is equal to rel->relids. If found, subsequent
processing will be performed.

== Before patched ==
List *
generate_implied_equalities_for_column(PlannerInfo *root,
RelOptInfo *rel,
ec_matches_callback_type callback,
void *callback_arg,
Relids prohibited_rels)
{
...

EquivalenceClass *cur_ec = (EquivalenceClass *)
list_nth(root->eq_classes, i);
EquivalenceMember *cur_em;
ListCell *lc2;

cur_em = NULL;
foreach(lc2, cur_ec->ec_members)
{
cur_em = (EquivalenceMember *) lfirst(lc2);
if (bms_equal(cur_em->em_relids, rel->relids) &&
callback(root, rel, cur_ec, cur_em, callback_arg))
break;
cur_em = NULL;
}

if (!cur_em)
continue;

...
}
===

My patch modifies this code as follows. The em_foreach_relids_equals
is a newly defined macro that finds EquivalenceMember satisfying the
bms_equal. The macro looks up a hash table using rel->relids as a key.
This type of optimization cannot be achieved without using hash tables
whose key is em->em_relids.

== After patched ==
List *
generate_implied_equalities_for_column(PlannerInfo *root,
RelOptInfo *rel,
ec_matches_callback_type callback,
void *callback_arg,
Relids prohibited_rels)
{
...

EquivalenceClass *cur_ec = (EquivalenceClass *)
list_nth(root->eq_classes, i);
EquivalenceMember *cur_em;
EquivalenceMember *other_em;

cur_em = NULL;
em_foreach_relids_equals(cur_em, cur_ec, rel->relids)
{
Assert(bms_equal(cur_em->em_relids, rel->relids));
if (callback(root, rel, cur_ec, cur_em, callback_arg))
break;
cur_em = NULL;
}

if (!cur_em)
continue;

...
}
===

We might not want to build the hash table for all queries.

I agree with you. Building a lot of hash tables will consume much
memory. My idea for this problem is to let the hash table's key be a
pair of EquivalenceClass and Relids. However, this approach may lead
to increasing looking up time of the hash table.

==========

I noticed that the previous patch does not work with the current HEAD.
I attached the modified one to this email.

Additionally, I added my patch to the current commit fest [1]https://commitfest.postgresql.org/38/3701/.
[1]: https://commitfest.postgresql.org/38/3701/

--
Best regards,
Yuya Watari

Attachments:

v2-reducing-planning-time-when-tables-have-many-partitions.patchapplication/x-patch; name=v2-reducing-planning-time-when-tables-have-many-partitions.patchDownload+678-34
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yuya Watari (#3)
Re: [PoC] Reducing planning time when tables have many partitions

Yuya Watari <watari.yuya@gmail.com> writes:

On Thu, Mar 24, 2022 at 11:03 AM David Rowley <dgrowleyml@gmail.com> wrote:

I think a better way to solve this would be just to have a single hash
table over all EquivalenceClasses that allows fast lookups of
EquivalenceMember->em_expr.

If the predicate were "em->em_expr == something", the hash table whose
key is em_expr would be effective. However, the actual predicates are
not of this type but the following.

// Find EquivalenceMembers whose relids is equal to the given relids
(1) bms_equal(em->em_relids, relids)

// Find EquivalenceMembers whose relids is a subset of the given relids
(2) bms_is_subset(em->em_relids, relids)

Yeah, that's a really interesting observation, and I agree that
David's suggestion doesn't address it. Maybe after we fix this
problem, matching of em_expr would be the next thing to look at,
but your results say it isn't the first thing.

I'm not real thrilled with trying to throw hashtables at the problem,
though. As David noted, they'd be counterproductive for simple
queries. Sure, we could address that with duplicate code paths,
but that's a messy and hard-to-tune approach. Also, I find the
idea of hashing on all subsets of relids to be outright scary.
"m is not so large in most cases" does not help when m *is* large.

For the bms_equal class of lookups, I wonder if we could get anywhere
by adding an additional List field to every RelOptInfo that chains
all EquivalenceMembers that match that RelOptInfo's relids.
The trick here would be to figure out when to build those lists.
The simple answer would be to do it lazily on-demand, but that
would mean a separate scan of all the EquivalenceMembers for each
RelOptInfo; I wonder if there's a way to do better?

Perhaps the bms_is_subset class could be handled in a similar
way, ie do a one-time pass to make a List of all EquivalenceMembers
that use a RelOptInfo.

regards, tom lane

#5Yuya Watari
watari.yuya@gmail.com
In reply to: Tom Lane (#4)
Re: [PoC] Reducing planning time when tables have many partitions

Dear Tom,

Thank you for replying to my email.

On Mon, Jul 4, 2022 at 6:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not real thrilled with trying to throw hashtables at the problem,
though. As David noted, they'd be counterproductive for simple
queries.

As you said, my approach that utilizes hash tables has some overheads,
leading to degradation in query planning time.

I tested the degradation by a brief experiment. In this experiment, I
used a simple query shown below.

===
SELECT students.name, gpas.gpa AS gpa, sum(scores.score) AS total_score
FROM students, scores, gpas
WHERE students.id = scores.student_id AND students.id = gpas.student_id
GROUP BY students.id, gpas.student_id;
===

Here, students, scores, and gpas tables have no partitions, i.e., they
are regular tables. Therefore, my techniques do not work for this
query and instead may lead to some regression. I repeatedly issued
this query 1 million times and measured their planning times.

The attached figure describes the distribution of the planning times.
The figure indicates that my patch has no severe negative impacts on
the planning performance. However, there seems to be a slight
degradation.

I show the mean and median of planning times below. With my patch, the
planning time became 0.002-0.004 milliseconds slower. We have to deal
with this problem, but reducing time complexity while keeping
degradation to zero is significantly challenging.

Planning time (ms)
| Mean | Median
------------------------------
Master | 0.682 | 0.674
Patched | 0.686 | 0.676
------------------------------
Degradation | 0.004 | 0.002

Of course, the attached result is just an example. Significant
regression might occur in other types of queries.

For the bms_equal class of lookups, I wonder if we could get anywhere
by adding an additional List field to every RelOptInfo that chains
all EquivalenceMembers that match that RelOptInfo's relids.
The trick here would be to figure out when to build those lists.
The simple answer would be to do it lazily on-demand, but that
would mean a separate scan of all the EquivalenceMembers for each
RelOptInfo; I wonder if there's a way to do better?

Perhaps the bms_is_subset class could be handled in a similar
way, ie do a one-time pass to make a List of all EquivalenceMembers
that use a RelOptInfo.

Thank you for giving your idea. I will try to polish up my algorithm
based on your suggestion.

--
Best regards,
Yuya Watari

Attachments:

figure.pngimage/png; name=figure.pngDownload+0-1
#6Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#5)
Re: [PoC] Reducing planning time when tables have many partitions

On 7/5/22 13:57, Yuya Watari wrote:

On Mon, Jul 4, 2022 at 6:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Perhaps the bms_is_subset class could be handled in a similar
way, ie do a one-time pass to make a List of all EquivalenceMembers
that use a RelOptInfo.

Thank you for giving your idea. I will try to polish up my algorithm
based on your suggestion.

This work has significant interest for highly partitioned
configurations. Are you still working on this patch? According to the
current state of the thread, changing the status to 'Waiting on author'
may be better until the next version.
Feel free to reverse the status if you need more feedback.

--
Regards
Andrey Lepikhov
Postgres Professional

#7Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#6)
Re: [PoC] Reducing planning time when tables have many partitions

Dear Andrey Lepikhov,

Thank you for replying and being a reviewer for this patch. I really
appreciate it.

Are you still working on this patch?

Yes, I’m working on improving this patch. It is not easy to address
the problems that this patch has, but I’m hoping to send a new version
of it in a few weeks.

--
Best regards,
Yuya Watari

#8David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#4)
Re: [PoC] Reducing planning time when tables have many partitions

On Mon, 4 Jul 2022 at 09:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For the bms_equal class of lookups, I wonder if we could get anywhere
by adding an additional List field to every RelOptInfo that chains
all EquivalenceMembers that match that RelOptInfo's relids.
The trick here would be to figure out when to build those lists.
The simple answer would be to do it lazily on-demand, but that
would mean a separate scan of all the EquivalenceMembers for each
RelOptInfo; I wonder if there's a way to do better?

How about, instead of EquivalenceClass having a List field named
ec_members, it has a Bitmapset field named ec_member_indexes and we
just keep a List of all EquivalenceMembers in PlannerInfo and mark
which ones are in the class by setting the bit in the class's
ec_member_indexes field.

That would be teamed up with a new eclass_member_indexes field in
RelOptInfo to store the index into PlannerInfo's List of
EquivalenceMembers that belong to the given RelOptInfo.

For searching:
If you want to get all EquivalenceMembers in an EquivalenceClass, you
bms_next_member loop over the EC's ec_member_indexes field.
If you want to get all EquivalenceMembers for a given RelOptInfo, you
bms_next_member loop over the RelOptInfo's eclass_member_indexes
field.
If you want to get all EquivalenceMembers for a given EquivalenceClass
and RelOptInfo you need to do some bms_intersect() calls for the rel's
eclass_member_indexes and EC's ec_member_indexes.

I'm unsure if we'd want to bms_union the RelOptInfo's
ec_member_indexes field for join rels. Looking at
get_eclass_indexes_for_relids() we didn't do it that way for
eclass_indexes. Maybe that's because we're receiving RelIds in a few
places without a RelOptInfo.

Certainly, the CPU cache locality is not going to be as good as if we
had a List with all elements together, but for simple queries, there's
not going to be many EquivalenceClasses anyway, and for complex
queries, this should be a win.

David

#9Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#8)
Re: [PoC] Reducing planning time when tables have many partitions

Dear David,

Thank you for sharing your new idea.

I agree that introducing a Bitmapset field may solve this problem. I
will try this approach in addition to previous ones.

Thank you again for helping me.

--
Best regards,
Yuya Watari

#10David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#9)
Re: [PoC] Reducing planning time when tables have many partitions

On Wed, 27 Jul 2022 at 18:07, Yuya Watari <watari.yuya@gmail.com> wrote:

I agree that introducing a Bitmapset field may solve this problem. I
will try this approach in addition to previous ones.

I've attached a very half-done patch that might help you get started
on this. There are still 2 failing regression tests which seem to be
due to plan changes. I didn't expend any effort looking into why these
plans changed.

The attached does not contain any actual optimizations to find the
minimal set of EMs to loop through by masking the Bitmapsets that I
mentioned in my post last night. I just quickly put it together to
see if there's some hole in the idea. I don't think there is.

I've not really considered all of the places that we'll want to do the
bit twiddling to get the minimal set of EquivalenceMember. I did see
there's a couple more functions in postgres_fdw.c that could be
optimized.

One thing I've only partially thought about is what if you want to
also find EquivalenceMembers with a constant value. If there's a
Const, then you'll lose the bit for that when you mask the ec's
ec_member_indexes with the RelOptInfos. If there are some places
where we need to keep those then I think we'll need to add another
field to EquivalenceClass to mark the index into PlannerInfo's
eq_members for the EquivalenceMember with the Const. That bit would
have to be bms_add_member()ed back into the Bitmapset of matching
EquivalenceMembers after masking out RelOptInfo's ec_member_indexes.

When adding the optimizations to find the minimal set of EM bits to
search through, you should likely add some functions similar to the
get_eclass_indexes_for_relids() and get_common_eclass_indexes()
functions to help you find the minimal set of bits. You can also
probably get some other inspiration from [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3373c715535, in general.

David

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3373c715535

Attachments:

eclass_member_speedup.patchtext/plain; charset=US-ASCII; name=eclass_member_speedup.patchDownload+263-158
#11Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#10)
Re: [PoC] Reducing planning time when tables have many partitions

Hello,

On Thu, Jul 28, 2022 at 6:35 AM David Rowley <dgrowleyml@gmail.com> wrote:

I've attached a very half-done patch that might help you get started
on this.

Thank you so much for creating the patch. I have implemented your
approach and attached a new version of the patch to this email.

If you have already applied David's patch, please start the 'git am'
command from 0002-Fix-bugs.patch. All regression tests passed with
this patch on my environment.

1. Optimizations

The new optimization techniques utilizing Bitmapsets are implemented
as the following functions in src/include/optimizer/paths.h.

* get_eclass_members_indexes_for_relids()
* get_eclass_members_indexes_for_not_children()
* get_eclass_members_indexes_for_relids_or_not_children()
* get_eclass_members_indexes_for_subsets_of_relids()
* get_eclass_members_indexes_for_subsets_of_relids_or_not_children()
// I think the names of these functions need to be reconsidered.

These functions intersect ec->ec_member_indexes and some Bitmapset and
return indexes of EquivalenceMembers that we want to get.

The implementation of the first three functions listed above is
simple. However, the rest functions regarding the bms_is_subset()
condition are a bit more complicated. I have optimized this case based
on Tom's idea. The detailed steps are as follows.

I. Intersect ec->ec_member_indexes and the Bitmapset in RelOptInfo.
This intersection set is a candidate for the EquivalenceMembers to be
retrieved.
II. Remove from the candidate set the members that do not satisfy the
bms_is_subset().

Optimization for EquivalenceMembers with a constant value is one of
the future works.

2. Experimental Results

I conducted an experiment by using the original query, which is
attached to this email. You can reproduce this experiment by the
following commands.

=====
psql -f create-tables.sql
psql -f query.sql
=====

The following table and the attached figure describe the experimental result.

Planning time of "query.sql" (n = the number of partitions)
----------------------------------------------------------------
n | Master (ms) | Patched (ms) | Speedup (%) | Speedup (ms)
----------------------------------------------------------------
1 | 0.809 | 0.760 | 6.09% | 0.049
2 | 0.799 | 0.811 | -1.53% | -0.012
4 | 1.022 | 0.989 | 3.20% | 0.033
8 | 1.357 | 1.325 | 2.32% | 0.032
16 | 2.149 | 2.026 | 5.69% | 0.122
32 | 4.357 | 3.925 | 9.91% | 0.432
64 | 9.543 | 7.543 | 20.96% | 2.000
128 | 27.195 | 15.823 | 41.82% | 11.372
256 | 130.207 | 52.664 | 59.55% | 77.542
384 | 330.642 | 112.324 | 66.03% | 218.318
512 | 632.009 | 197.957 | 68.68% | 434.052
640 | 1057.193 | 306.861 | 70.97% | 750.333
768 | 1709.914 | 463.628 | 72.89% | 1246.287
896 | 2531.685 | 738.827 | 70.82% | 1792.858
1024 | 3516.592 | 858.211 | 75.60% | 2658.381
----------------------------------------------------------------

-------------------------------------------------------
n | Stddev of Master (ms) | Stddev of Patched (ms)
-------------------------------------------------------
1 | 0.085 | 0.091
2 | 0.061 | 0.091
4 | 0.153 | 0.118
8 | 0.203 | 0.107
16 | 0.150 | 0.153
32 | 0.313 | 0.242
64 | 0.411 | 0.531
128 | 1.263 | 1.109
256 | 5.592 | 4.714
384 | 17.423 | 6.625
512 | 20.172 | 7.188
640 | 40.964 | 26.246
768 | 61.924 | 31.741
896 | 66.481 | 27.819
1024 | 80.950 | 49.162
-------------------------------------------------------

The speed up with the new patch was up to 75.6% and 2.7 seconds. The
patch achieved a 21.0% improvement even with 64 partitions, which is a
realistic size. We can conclude that this optimization is very
effective in workloads with highly partitioned tables.

Performance degradation occurred only when the number of partitions
was 2, and its degree was 1.53% or 12 microseconds. This degradation
is the difference between the average planning times of 10000 runs.
Their standard deviations far exceed the difference in averages. It is
unclear whether this degradation is an error.

=====

I'm looking forward to your comments.

--
Best regards,
Yuya Watari

Attachments:

0002-Fix-bugs.patchapplication/octet-stream; name=0002-Fix-bugs.patchDownload+54-30
0003-Implement-optimizations-based-on-Bitmapset.patchapplication/octet-stream; name=0003-Implement-optimizations-based-on-Bitmapset.patchDownload+329-104
0001-EClass-member-speedup.patchapplication/octet-stream; name=0001-EClass-member-speedup.patchDownload+263-159
figure.pngimage/png; name=figure.pngDownload
query.sqlapplication/octet-stream; name=query.sqlDownload
create-tables.sqlapplication/octet-stream; name=create-tables.sqlDownload
#12David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#11)
Re: [PoC] Reducing planning time when tables have many partitions

On Mon, 8 Aug 2022 at 23:28, Yuya Watari <watari.yuya@gmail.com> wrote:

If you have already applied David's patch, please start the 'git am'
command from 0002-Fix-bugs.patch. All regression tests passed with
this patch on my environment.

Thanks for fixing those scope bugs.

In regards to the 0002 patch, you have;

+ * TODO: "bms_add_members(ec1->ec_member_indexes, ec2->ec_member_indexes)"
+ * did not work to combine two EquivalenceClasses. This is probably because
+ * the order of the EquivalenceMembers is different from the previous
+ * implementation, which added the ec2's EquivalenceMembers to the end of
+ * the list.

as far as I can see, the reason the code I that wrote caused the
following regression test failure;

-         Index Cond: ((ff = '42'::bigint) AND (ff = '42'::bigint))
+         Index Cond: (ff = '42'::bigint)

was down to how generate_base_implied_equalities_const() marks the EC
as ec_broken = true without any regard to cleaning up the work it's
partially already complete.

Because the loop inside generate_base_implied_equalities_const() just
breaks as soon as we're unable to find a valid equality operator for
the two given types, with my version, since the EquivalenceMember's
order has effectively changed, we just discover the EC is broken
before we call process_implied_equality() ->
distribute_restrictinfo_to_rels(). In the code you've added, the
EquivalenceMembers are effectively still in the original order and the
process_implied_equality() -> distribute_restrictinfo_to_rels() gets
done before we discover the broken EC. The same qual is just added
again during generate_base_implied_equalities_broken(), which is why
the plan has a duplicate ff=42.

This is all just down to the order that the ECs are merged. If you'd
just swapped the order of the items in the query's WHERE clause to
become:

where ec1.ff = 42::int8 and ss1.x = ec1.f1 and ec1.ff = ec1.f1;

then my version would keep the duplicate qual. For what you've changed
the code to, the planner would not have produced the duplicate ff=42
qual if you'd written the WHERE clause as follows:

where ss1.x = ec1.f1 and ec1.ff = ec1.f1 and ec1.ff = 42::int8;

In short, I think the code I had for that was fine and it's just the
expected plan that you should be editing. If we wanted to this
behaviour to be consistent then the fix should be to make
generate_base_implied_equalities_const() better at only distributing
the quals down to the relations after it has discovered that the EC is
not broken, or at least cleaning up the partial work that it's done if
it discovers a broken EC. The former seems better to me, but I doubt
that it matters too much as broken ECs should be pretty rare and it
does not seem worth spending too much effort making this work better.

I've not had a chance to look at the 0003 patch yet.

David

#13David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#12)
Re: [PoC] Reducing planning time when tables have many partitions

esOn Tue, 9 Aug 2022 at 19:10, David Rowley <dgrowleyml@gmail.com> wrote:

I've not had a chance to look at the 0003 patch yet.

I've looked at the 0003 patch now.

The performance numbers look quite impressive, however, there were a
few things about the patch that I struggled to figure what they were
done the way you did them:

+ root->eq_not_children_indexes = bms_add_member(root->eq_not_children_indexes,

Why is that in PlannerInfo rather than in the EquivalenceClass?

if (bms_equal(rel->relids, em->em_relids))
{
rel->eclass_member_indexes =
bms_add_member(rel->eclass_member_indexes, em_index);
}

Why are you only adding the eclass_member_index to the RelOptInfo when
the em_relids contain a singleton relation?

I ended up going and fixing the patch to be more how I imagined it.

I've ended up with 3 Bitmapset fields in EquivalenceClass;
ec_member_indexes, ec_nonchild_indexes, ec_norel_indexes. I also
trimmed the number of helper functions down for obtaining the minimal
set of matching EquivalenceMember indexes to just:

Bitmapset *
get_ecmember_indexes(PlannerInfo *root, EquivalenceClass *ec, Relids relids,
bool with_children, bool with_norel_members)

Bitmapset *
get_ecmember_indexes_strict(PlannerInfo *root, EquivalenceClass *ec,
Relids relids, bool with_children,
bool with_norel_members)

I'm not so much a fan of the bool parameters, but it seemed better
than having 8 different functions with each combination of the bool
paramters instead of 2.

The "strict" version of the function takes the intersection of
eclass_member_indexes for each rel mentioned in relids, whereas the
non-strict version does a union of those. Each then intersect that
with all members in the 'ec', or just the non-child members when
'with_children' is false. They both then optionally bms_add_members()
the ec_norel_members if with_norel_members is true. I found it
difficult to figure out the best order to do the intersection. That
really depends on if the particular query has many EquivalenceClasses
with few EquivalenceMembers or few EquivalenceClasses with many
EquivalenceMembers. bms_int_members() always recycles the left input.
Ideally, that would always be the smallest Bitmapset. Maybe it's worth
inventing a new version of bms_int_members() that recycles the input
with the least nwords. That would give the subsequent
bms_next_member() calls an easier time. Right now they'll need to loop
over a bunch of 0 words at the end for many queries.

A few problems I ran into along the way:

1. generate_append_tlist() generates Vars with varno=0. That causes
problems when we add Exprs from those in add_eq_member() as there is
no element at root->simple_rel_array[0] to add eclass_member_indexes
to.
2. The existing comment for EquivalenceMember.em_relids claims "all
relids appearing in em_expr", but that's just not true when it comes
to em_is_child members.

So far, I fixed #1 by adding a hack to setup_simple_rel_arrays() to do
"root->simple_rel_array[0] = makeNode(RelOptInfo);" I'm not suggesting
that's the correct fix. It might be possible to set the varnos to the
varnos from the first Append child instead.

The fact that #2 is not true adds quite a bit of complexity to the
patch and I think the patch might even misbehave as a result. It seems
there are cases where a child em_relids can contain additional relids
that are not present in the em_expr. For example, when a UNION ALL
child has a Const in the targetlist, as explained in a comment in
add_child_rel_equivalences(). However, there also seem to be cases
where the opposite is true. I had to add the following code in
add_eq_member() to stop a regression test failing:

if (is_child)
expr_relids = bms_add_members(expr_relids, relids);

That's to make sure we add eclass_member_indexes to each RelOptInfo
mentioned in the em_expr.

After doing all that, I noticed that your benchmark was showing that
create_join_clause() was the new bottleneck. This was due to having to
loop so many times over the ec_sources to find an already built
RestrictInfo. I went off and added some new code to optimize the
lookup of those in a similar way by adding a new Bitmapset field in
RelOptInfo to index which ec_sources it mentioned, which meant having
to move ec_sources into PlannerInfo. I don't think this part of the
patch is quite right yet as the code I have relies on em_relids being
the same as the ones mentioned in the RestrictInfo. That seems not
true for em_is_child EMs, so I think we probably need to add a new
field to EquivalenceMember that truly is just pull_varnos from
em_expr, or else look into some way to make em_relids mean that (like
the comment claims).

Here are my results from running your benchmark on master (@f6c750d31)
with and without the attached patch.

npart master (ms) patched (ms) speedup
2 0.28 0.29 95.92%
4 0.37 0.38 96.75%
8 0.53 0.56 94.43%
16 0.92 0.91 100.36%
32 1.82 1.70 107.57%
64 4.05 3.26 124.32%
128 10.83 6.69 161.89%
256 42.63 19.46 219.12%
512 194.31 42.60 456.14%
1024 1104.02 98.37 1122.33%

This resulted in some good additional gains in planner performance.
The 1024 partition case is now about 11x faster on my machine instead
of 4x. The 2 partition does regress slightly. There might be a few
things we can do about that, for example, move ec_collation up 1 to
shrink EquivalenceClass back down closer to the size it was before.
[1]: https://commitfest.postgresql.org/39/3810/

I've attached a draft patch with my revisions.

David

[1]: https://commitfest.postgresql.org/39/3810/

Attachments:

eclass_member_speedup_v3.patchtext/plain; charset=US-ASCII; name=eclass_member_speedup_v3.patchDownload+877-284
#14Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#13)
Re: [PoC] Reducing planning time when tables have many partitions

Dear David,

I really appreciate your reply and your modifying the patch. The
performance improvements are quite impressive. I believe these
improvements will help PostgreSQL users. Thank you again.

The 2 partition does regress slightly. There might be a few
things we can do about that

I tried to solve this regression problem. From here, I will refer to
the patch you sent on August 16th as the v3 patch. I will also call my
patch attached to this email the v4 patch. I will discuss the v4 patch
later.

Additionally, I give names to queries.
* Query A: The query we have been using in previous emails, which
joins students, scores, and gpas tables.
* Query B: The query which is attached to this email.

Query B is as follows:

===
SELECT *
FROM testtable_1, testtable_2, testtable_3, testtable_4, testtable_5,
testtable_6, testtable_7, testtable_8
WHERE testtable_1.x = testtable_2.x AND testtable_1.x = testtable_3.x
AND testtable_1.x = testtable_4.x AND testtable_1.x = testtable_5.x
AND testtable_1.x = testtable_6.x AND testtable_1.x = testtable_7.x
AND testtable_1.x = testtable_8.x;
===

Query A joins three tables, whereas Query B joins eight tables. Since
EquivalenceClass is used when handling chained join conditions, I
thought queries joining many tables, such as Query B, would have
greater performance impacts.

I have investigated the v3 patch with these queries. As a result, I
did not observe any regressions in Query A in my environment. However,
the v3 patch showed significant degradation in Query B.

The following table and Figures 1 and 2 describe the result. The v3
patch resulted in a regression of 8.7% for one partition and 4.8% for
two partitions. Figure 2 shows the distribution of planning times for
the 1-partition case, indicating that the 8.7% regression is not an
error.

Table 1: Planning time of Query B
(n: number of partitions)
(milliseconds)
----------------------------------------------------------------
n | Master | v3 | v4 | Master / v3 | Master / v4
----------------------------------------------------------------
1 | 54.926 | 60.178 | 55.275 | 91.3% | 99.4%
2 | 53.853 | 56.554 | 53.519 | 95.2% | 100.6%
4 | 57.115 | 57.829 | 55.648 | 98.8% | 102.6%
8 | 64.208 | 60.945 | 58.025 | 105.4% | 110.7%
16 | 79.818 | 65.526 | 63.365 | 121.8% | 126.0%
32 | 136.981 | 77.813 | 76.526 | 176.0% | 179.0%
64 | 371.991 | 108.058 | 110.202 | 344.2% | 337.6%
128 | 1449.063 | 173.326 | 181.302 | 836.0% | 799.3%
256 | 6245.577 | 333.480 | 354.961 | 1872.8% | 1759.5%
----------------------------------------------------------------

This performance degradation is due to the heavy processing of the
get_ec***_indexes***() functions. These functions are the core part of
the optimization we are working on in this thread, but they are
relatively heavy when the number of partitions is small.

I noticed that these functions were called repeatedly with the same
arguments. During planning Query B with one partition, the
get_ec_source_indexes_strict() function was called 2087 times with
exactly the same parameters. Such repeated calls occurred many times
in a single query.

To address this problem, I introduced a caching mechanism in the v4
patch. This patch caches the Bitmapset once it has been computed.
After that, we only have to read the cached value instead of
performing the same process. Of course, we cannot devote much time to
the caching itself. Hash tables are a simple solution to accomplish
this but are not available under the current case where microsecond
performance degradation is a problem. Therefore, my patch adopts
another approach. I will use the following function as an example to
explain it.

===
Bitmapset *get_ecmember_indexes(PlannerInfo *root,
EquivalenceClass *ec, Relids relids, bool with_children, bool
with_norel_members);
===

My idea is "caching the returned Bitmapset into Relids." If the Relids
has the result Bitmapset, we can access it quickly via the pointer. Of
course, I understand this description is not accurate. Relids is just
an alias of Bitmapset, so we cannot change the layout.

I will describe the precise mechanism. In the v4 patch, I changed the
signature of the get_ecmember_indexes() function as follows.

===
Bitmapset *get_ecmember_indexes(PlannerInfo *root,
EquivalenceClass *ec, Relids relids, bool with_children, bool
with_norel_members, ECIndexCache *cache);
===

ECIndexCache is storage for caching returned values. ECIndexCache has
a one-to-one relationship with Relids. This relationship is achieved
by placing the ECIndexCache just alongside the Relids. For example,
ECIndexCache corresponding to some RelOptInfo's relids exists in the
same RelOptInfo. When calling the get_ecmember_indexes() function with
a RelOptInfo, we pass RelOptInfo->ECIndexCache together. On the other
hand, since Relids appear in various places, it is sometimes difficult
to prepare a corresponding ECIndexCache. In such cases, we give up
caching and pass NULL.

Besides, one ECIndexCache can only map to one EquivalenceClass.
ECIndexCache only caches for the first EquivalenceClass it encounters
and does not cache for another EC.

My method abandons full caching to prevent overhead. However, it
overcame the regression problem for Query B. As can be seen from
Figure 2, the regression with the v4 patch is either non-existent or
negligible. Furthermore, the v4 patch is faster than the v3 patch when
the number of partitions is 32 or less.

In addition to Query B, the results with Query A are shown in Figure
3. I cannot recognize any regression from Figure 3. Please be noted
that these results are done on my machine and may differ in other
environments.

However, when the number of partitions was relatively large, my patch
was slightly slower than the v3 patch. This may be due to too frequent
memory allocation. ECIndexCache is a large struct containing 13
pointers. In the current implementation, ECIndexCache exists within
commonly used structs such as RelOptInfo. Therefore, ECIndexCache is
allocated even if no one uses it. When there were 256 partitions of
Query B, 88509 ECIndexCache instances were allocated, but only 2295
were actually used. This means that 95.4% were wasted. I think
on-demand allocation would solve this problem. Similar problems could
also occur with other workloads, including OLTP. I'm going to try this
approach soon.

I really apologize for not commenting on the rest of your reply. I
will continue to consider them.

--
Best regards,
Yuya Watari

Attachments:

v4-0001-Apply-eclass_member_speedup_v3.patch.patchapplication/octet-stream; name=v4-0001-Apply-eclass_member_speedup_v3.patch.patchDownload+877-285
v4-0002-Implement-ECIndexCache.patchapplication/octet-stream; name=v4-0002-Implement-ECIndexCache.patchDownload+243-45
figure-1.pngimage/png; name=figure-1.pngDownload+0-3
figure-2.pngimage/png; name=figure-2.pngDownload+8-1
figure-3.pngimage/png; name=figure-3.pngDownload+4-0
query-b-create-tables.sqlapplication/octet-stream; name=query-b-create-tables.sqlDownload
query-b-query.sqlapplication/octet-stream; name=query-b-query.sqlDownload
#15David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#14)
Re: [PoC] Reducing planning time when tables have many partitions

On Fri, 26 Aug 2022 at 12:40, Yuya Watari <watari.yuya@gmail.com> wrote:

This performance degradation is due to the heavy processing of the
get_ec***_indexes***() functions. These functions are the core part of
the optimization we are working on in this thread, but they are
relatively heavy when the number of partitions is small.

I noticed that these functions were called repeatedly with the same
arguments. During planning Query B with one partition, the
get_ec_source_indexes_strict() function was called 2087 times with
exactly the same parameters. Such repeated calls occurred many times
in a single query.

How about instead of doing this caching like this, why don't we code
up some iterators that we can loop over to fetch the required EMs.

I'll attempt to type out my thoughts here without actually trying to
see if this works:

typedef struct EquivalenceMemberIterator
{
EquivalenceClass *ec;
Relids relids;
Bitmapset *em_matches;
int position; /* last found index of em_matches or -1 */
bool use_index;
bool with_children;
bool with_norel_members;
} EquivalenceMemberIterator;

We'd then have functions like:

static void
get_ecmember_indexes_iterator(EquivalenceMemberIterator *it,
PlannerInfo *root, EquivalenceClass *ec, Relids relids, bool
with_children, bool with_norel_members)
{
it->ec = ec;
it->relids = relids;
it->position = -1;

it->use_index = (root->simple_rel_array_size > 32); /* or whatever
threshold is best */
it->with_children = with_children;
it->with_norel_members = with_norel_members;

if (it->use_index)
it->em_matches = get_ecmember_indexes(root, ec, relids,
with_children, with_norel_members);
else
it->em_matches = NULL;
}

static EquivalenceMember *
get_next_matching_member(PlannerInfo *root, EquivalenceMemberIterator *it)
{
if (it->use_index)
{
it->position = bms_next_member(it->ec_matches, it->position);
if (it->position >= 0)
return list_nth(root->eq_members, it->position);
return NULL;
}
else
{
int i = it->position;
while ((i = bms_next_member(it->ec->ec_member_indexes, i) >= 0)
{
/* filter out the EMs we don't want here "break" when
we find a match */
}
it->position = i;
if (i >= 0)
return list_nth(root->eq_members, i);
return NULL;
}
}

Then the consuming code will do something like:

EquivalenceMemberIterator iterator;
get_ecmember_indexes_iterator(&iterator, root, ec, relids, true, false);

while ((cur_em = get_next_matching_member(root, &it)) != NULL)
{
// do stuff
}

David

#16Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#15)
Re: [PoC] Reducing planning time when tables have many partitions

Dear David,

On Fri, Aug 26, 2022 at 12:18 PM David Rowley <dgrowleyml@gmail.com> wrote:

How about instead of doing this caching like this, why don't we code
up some iterators that we can loop over to fetch the required EMs.

Thank you very much for your quick reply and for sharing your idea
with code. I also think introducing EquivalenceMemberIterator is one
good alternative solution. I will try to implement and test it.

Thank you again for helping me.

--
Best regards,
Yuya Watari

#17Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#16)
Re: [PoC] Reducing planning time when tables have many partitions

Hello,

On Fri, Aug 26, 2022 at 5:53 PM Yuya Watari <watari.yuya@gmail.com> wrote:

Thank you very much for your quick reply and for sharing your idea
with code. I also think introducing EquivalenceMemberIterator is one
good alternative solution. I will try to implement and test it.

I apologize for my late response. I have implemented several
approaches and tested them.

1. Changes

I will describe how I modified our codes. I tested five versions:

* v1: The first draft patch by David with bug fixes by me. This patch
does not perform any optimizations based on Bitmapset operations.
* v3: The past patch
* v5 (v3 with revert): The v3 with revert of one of our optimizations
* v6 (Iterator): An approach using iterators to enumerate over
EquivalenceMembers. This approach is David's suggestion in the
previous email.
* v7 (Cache): My approach to caching the result of get_ec***indexes***()

Please be noted that there is no direct parent-child relationship
between v6 and v7; they are v5's children, i.e., siblings. I'm sorry
for the confusing versioning.

1.1. Revert one of our optimizations (v5)

As I mentioned in the comment in
v[5|6|7]-0002-Revert-one-of-the-optimizations.patch, I reverted one of
our optimizations. This code tries to find EquivalenceMembers that do
not satisfy the bms_overlap condition. We encounter such members early
in the loop, so the linear search is enough, and our optimization is
too excessive here. As a result of experiments, I found this
optimization was a bottleneck, so I reverted it.

v6 (Iterator) and v7 (Cache) include this revert.

1.2. Iterator (v6)

I have implemented the iterator approach. The code is based on what
David advised, but I customized it a bit. I added the "bool
caller_needs_recheck" argument to get_ecmember_indexes_iterator() and
other similar functions. If this argument is true, the iterator
enumerates all EquivalenceMembers without checking conditions such as
bms_is_subset or bms_overlap.

This change is because callers of these iterators sometimes recheck
desired conditions after calling it. For example, if some caller wants
EquivalenceMembers whose Relids is equal to some value, it calls
get_ecmember_indexes(). However, since the result may have false
positives, the caller has to recheck the result by the bms_equal()
condition. In this case, if the threshold is below and we don't
perform our optimization, checking bms_overlap() in the iterator does
not make sense. We can solve this problem by passing true to the
"caller_needs_recheck" argument to skip redundant checking.

1.3. Cache (v7)

I have improved my caching approach. First, I introduced the on-demand
allocation approach I mentioned in the previous email. ECIndexCache is
allocated not together with RelOptInfo but when using it.

In addition to this, a new version of the patch can handle multiple
EquivalenceClasses. In the previous version, caching was only possible
for one EquivalenceClass. This limitation is to prevent overhead but
reduces caching opportunities. So, I have improved it so that it can
handle all EquivalenceClasses. I made this change on the advice of
Fujita-san. Thank you, Fujita-san.

2. Experimental Results

I conducted experiments to test these methods.

2.1. Query A

Figure 1 illustrates the planning times of Query A. Please see the
previous email for what Query A refers to. The performance of all
methods except master and v1 are almost the same. I cannot observe any
degradation from this figure.

2.2. Query B

Query B joins eight tables. In the previous email, I mentioned that
the v3 patch has significant degradation for this query.

Figure 2 and Table 1 show the results. The three approaches of v5, v6
(Iterator), and v7 (Cache) showed good overall performance. In
particular, v7 (Cache) performed best for the smaller number of
partitions.

Table 1: Planning Time of Query B (ms)
-------------------------------------
n | Master | v1 | v3
-------------------------------------
1 | 55.459 | 57.376 | 58.849
2 | 54.162 | 56.454 | 57.615
4 | 56.491 | 59.742 | 57.108
8 | 62.694 | 67.920 | 59.591
16 | 79.547 | 90.589 | 64.954
32 | 134.623 | 160.452 | 76.626
64 | 368.716 | 439.894 | 107.278
128 | 1374.000 | 1598.748 | 170.909
256 | 5955.762 | 6921.668 | 324.113
-------------------------------------
--------------------------------------------------------
n | v5 (v3 with revert) | v6 (Iterator) | v7 (Cache)
--------------------------------------------------------
1 | 56.268 | 57.520 | 56.703
2 | 55.511 | 55.212 | 54.395
4 | 55.643 | 55.025 | 54.996
8 | 57.770 | 57.519 | 57.114
16 | 63.075 | 63.117 | 63.161
32 | 74.788 | 74.369 | 75.801
64 | 104.027 | 104.787 | 105.450
128 | 169.473 | 169.019 | 174.919
256 | 321.450 | 322.739 | 342.601
--------------------------------------------------------

2.3. Join Order Benchmark

It is essential to test real workloads, so I used the Join Order
Benchmark [1]https://github.com/winkyao/join-order-benchmark. This benchmark contains many complicated queries
joining a lot of tables. I partitioned fact tables by 'id' columns and
measured query planning times.

Figure 3 and Table 2 describe the results. The results showed that all
methods produced some degradations when there were not so many
partitions. However, the degradation of v7 (cache) was relatively
small. It was 0.8% with two partitions, while the other methods'
degradation was at least 1.6%.

Table 2: Speedup of Join Order Benchmark (higher is better)
-----------------------------------------------------------------
n | v3 | v5 (v3 with revert) | v6 (Iterator) | v7 (Cache)
-----------------------------------------------------------------
2 | 95.8% | 97.3% | 97.3% | 97.7%
4 | 96.9% | 98.4% | 98.0% | 99.2%
8 | 102.2% | 102.9% | 98.1% | 103.0%
16 | 107.6% | 109.5% | 110.1% | 109.4%
32 | 123.5% | 125.4% | 125.5% | 125.0%
64 | 165.2% | 165.9% | 164.6% | 165.9%
128 | 308.2% | 309.2% | 312.1% | 311.4%
256 | 770.1% | 772.3% | 776.6% | 773.2%
-----------------------------------------------------------------

2.4. pgbench

Our optimizations must not cause negative impacts on OLTP workloads. I
conducted pgbench, and Figure 4 and Table 3 show its result.

Table 3: The result of pgbench (tps)
------------------------------------------------------------------------
n | Master | v3 | v5 (v3 with revert) | v6 (Iterator) | v7 (Cache)
------------------------------------------------------------------------
1 | 7617 | 7510 | 7484 | 7599 | 7561
2 | 7613 | 7487 | 7503 | 7609 | 7560
4 | 7559 | 7497 | 7453 | 7560 | 7553
8 | 7506 | 7429 | 7405 | 7523 | 7503
16 | 7584 | 7481 | 7466 | 7558 | 7508
32 | 7556 | 7456 | 7448 | 7558 | 7521
64 | 7555 | 7452 | 7435 | 7541 | 7504
128 | 7542 | 7430 | 7442 | 7558 | 7517
------------------------------------------------------------------------
Avg | 7566 | 7468 | 7455 | 7563 | 7528
------------------------------------------------------------------------

This result indicates that v3 and v5 (v3 with revert) had a
significant negative impact on the pgbench workload. Their tps
decreased by 1.3% or more. On the other hand, degradations of v6
(Iterator) and v7 (Cache) are non-existent or negligible.

3. Causes of Degression

We could not avoid degradation with the Join Order Benchmark. The
leading cause of this problem is that Bitmapset operation, especially
bms_next_member(), is relatively slower than simple enumeration over
List.

It is easy to imagine that bms_next_member(), which has complex bit
operations, is a little heavier than List enumerations simply
advancing a pointer. The fact that even the v1, where we don't perform
any optimizations, slowed down supports this notion.

I think preventing this regression is very hard. To do so, we must
have both List and Bitmapset representations of EquivalenceMembers.
However, I don't prefer this solution because it is redundant and
leads to less code maintainability.

Reducing Bitmapset->nwords is another possible solution. I will try
it, but it will likely not solve the significant degradation in
pgbench for v3 and v5. This is because such degradation did not occur
with v6 and v7, with also use Bitmapset.

4. Which Method is The Best?

First of all, it is hard to adopt v3 and v5 (v3 with revert) because
they degrade performance on OLTP workloads. Therefore, v6 (Iterator)
and v7 (Cache) are possible candidates. Of these methods, I prefer v7
(Cache).

Actually, I don't think an approach to introducing thresholds is a
good idea because the best threshold is unclear. If we become
conservative to avoid degradation, we must increase the threshold, but
that takes away the opportunity for optimization. The opposite is
true.

In contrast, v7 (Cache) is an essential solution in terms of reducing
the cost of repeated function calls and does not require the
introduction of a threshold. Besides, it performs better on almost all
workloads, including the Join Order Benchmark. It also has no negative
impacts on OLTP.

In conclusion, I think v7 (Cache) is the most desirable. Of course,
the method may have some problems, but it is worth considering.

[1]: https://github.com/winkyao/join-order-benchmark

--
Best regards,
Yuya Watari

Attachments:

figure-1.pngimage/png; name=figure-1.pngDownload+2-1
figure-2.pngimage/png; name=figure-2.pngDownload+3-2
figure-3.pngimage/png; name=figure-3.pngDownload
figure-4.pngimage/png; name=figure-4.pngDownload
v5-0001-Apply-eclass_member_speedup_v3.patch.patchapplication/octet-stream; name=v5-0001-Apply-eclass_member_speedup_v3.patch.patchDownload+877-285
v5-0002-Revert-one-of-the-optimizations.patchapplication/octet-stream; name=v5-0002-Revert-one-of-the-optimizations.patchDownload+12-10
v6-0001-Apply-eclass_member_speedup_v3.patch.patchapplication/octet-stream; name=v6-0001-Apply-eclass_member_speedup_v3.patch.patchDownload+877-285
v6-0002-Revert-one-of-the-optimizations.patchapplication/octet-stream; name=v6-0002-Revert-one-of-the-optimizations.patchDownload+12-10
v6-0003-Implement-iterators.patchapplication/octet-stream; name=v6-0003-Implement-iterators.patchDownload+537-88
v7-0001-Apply-eclass_member_speedup_v3.patch.patchapplication/octet-stream; name=v7-0001-Apply-eclass_member_speedup_v3.patch.patchDownload+877-285
v7-0002-Revert-one-of-the-optimizations.patchapplication/octet-stream; name=v7-0002-Revert-one-of-the-optimizations.patchDownload+12-10
v7-0003-Implement-ECIndexCache.patchapplication/octet-stream; name=v7-0003-Implement-ECIndexCache.patchDownload+255-44
#18Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#17)
Re: [PoC] Reducing planning time when tables have many partitions

Hello,

On Wed, Sep 21, 2022 at 6:43 PM Yuya Watari <watari.yuya@gmail.com> wrote:

1.1. Revert one of our optimizations (v5)

As I mentioned in the comment in
v[5|6|7]-0002-Revert-one-of-the-optimizations.patch, I reverted one of
our optimizations. This code tries to find EquivalenceMembers that do
not satisfy the bms_overlap condition. We encounter such members early
in the loop, so the linear search is enough, and our optimization is
too excessive here. As a result of experiments, I found this
optimization was a bottleneck, so I reverted it.

In the previous mail, I proposed a revert of one excessive
optimization. In addition, I found a new bottleneck and attached a new
version of the patch solving it to this email.

The new bottleneck exists in the select_outer_pathkeys_for_merge()
function. At the end of this function, we count EquivalenceMembers
that satisfy the specific condition. To count them, we have used
Bitmapset operations. Through experiments, I concluded that this
optimization is effective for larger cases but leads to some
degradation for the smaller number of partitions. The new patch
switches two algorithms depending on the problem sizes.

1. Experimental result

1.1. Join Order Benchmark

As in the previous email, I used the Join Order Benchmark to evaluate
the patches' performance. The correspondence between each version and
patches is as follows.

v3: v8-0001-*.patch
v5 (v3 with revert): v8-0001-*.patch + v8-0002-*.patch
v8 (v5 with revert): v8-0001-*.patch + v8-0002-*.patch + v8-0003-*.patch

I show the speed-up of each method compared with the master branch in
Table 1. When the number of partitions is 1, performance degradation
is kept to 1.1% in v8, while they are 4.2% and 1.8% in v3 and v5. This
result indicates that a newly introduced revert is effective.

Table 1: Speedup of Join Order Benchmark (higher is better)
(n = the number of partitions)
----------------------------------------------------------
n | v3 | v5 (v3 with revert) | v8 (v5 with revert)
----------------------------------------------------------
2 | 95.8% | 98.2% | 98.9%
4 | 97.2% | 99.7% | 99.3%
8 | 101.4% | 102.5% | 103.4%
16 | 108.7% | 111.4% | 110.2%
32 | 127.1% | 127.6% | 128.8%
64 | 169.5% | 172.1% | 172.4%
128 | 330.1% | 335.2% | 332.3%
256 | 815.1% | 826.4% | 821.8%
----------------------------------------------------------

1.2. pgbench

The following table describes the result of pgbench. The v5 and v8
performed clearly better than the v3 patch. The difference between v5
and v8 is not so significant, but v8's performance is close to the
master branch.

Table 2: The result of pgbench (tps)
-----------------------------------------------------------------
n | Master | v3 | v5 (v3 with revert) | v8 (v5 with revert)
-----------------------------------------------------------------
1 | 7550 | 7422 | 7474 | 7521
2 | 7594 | 7381 | 7536 | 7529
4 | 7518 | 7362 | 7461 | 7524
8 | 7459 | 7340 | 7424 | 7460
-----------------------------------------------------------------
Avg | 7531 | 7377 | 7474 | 7509
-----------------------------------------------------------------

2. Conclusion and future works

The revert in the v8-0003-*.patch is effective in preventing
performance degradation for the smaller number of partitions. However,
I don't think what I have done in the patch is the best or ideal
solution. As I mentioned in the comments in the patch, switching two
algorithms may be ugly because it introduces code duplication. We need
a wiser solution to this problem.

--
Best regards,
Yuya Watari

Attachments:

v8-0001-Apply-eclass_member_speedup_v3.patch.patchapplication/octet-stream; name=v8-0001-Apply-eclass_member_speedup_v3.patch.patchDownload+874-284
v8-0002-Revert-one-of-the-optimizations.patchapplication/octet-stream; name=v8-0002-Revert-one-of-the-optimizations.patchDownload+12-10
v8-0003-Use-conventional-algorithm-for-smaller-size-cases.patchapplication/octet-stream; name=v8-0003-Use-conventional-algorithm-for-smaller-size-cases.patchDownload+49-10
#19Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#18)
Re: [PoC] Reducing planning time when tables have many partitions

Hello,

I noticed that the previous patch does not apply to the current HEAD.
I attached the rebased version to this email.

--
Best regards,
Yuya Watari

Attachments:

v8-0001-Apply-eclass_member_speedup_v3.patch.patchapplication/octet-stream; name=v8-0001-Apply-eclass_member_speedup_v3.patch.patchDownload+876-286
v8-0002-Revert-one-of-the-optimizations.patchapplication/octet-stream; name=v8-0002-Revert-one-of-the-optimizations.patchDownload+12-10
v8-0003-Use-conventional-algorithm-for-smaller-size-cases.patchapplication/octet-stream; name=v8-0003-Use-conventional-algorithm-for-smaller-size-cases.patchDownload+49-10
#20Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#19)
Re: [PoC] Reducing planning time when tables have many partitions

On 2/11/2022 15:27, Yuya Watari wrote:

Hello,

I noticed that the previous patch does not apply to the current HEAD.
I attached the rebased version to this email.

I'm still in review of your patch now. At most it seems ok, but are you
really need both eq_sources and eq_derives lists now? As I see,
everywhere access to these lists guides by eclass_source_indexes and
eclass_derive_indexes correspondingly. Maybe to merge them?

--
regards,
Andrey Lepikhov
Postgres Professional

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrei Lepikhov (#20)
#22Zhang Mingli
zmlpostgres@gmail.com
In reply to: Tom Lane (#21)
#23Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#19)
#24Thom Brown
thom@linux.com
In reply to: Zhang Mingli (#22)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Thom Brown (#24)
#26Thom Brown
thom@linux.com
In reply to: Alvaro Herrera (#25)
#27Thom Brown
thom@linux.com
In reply to: Thom Brown (#26)
#28Yuya Watari
watari.yuya@gmail.com
In reply to: Thom Brown (#27)
#29David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#28)
#30Thom Brown
thom@linux.com
In reply to: David Rowley (#29)
#31David Rowley
dgrowleyml@gmail.com
In reply to: Thom Brown (#30)
#32Thom Brown
thom@linux.com
In reply to: David Rowley (#31)
#33Yuya Watari
watari.yuya@gmail.com
In reply to: Thom Brown (#32)
#34David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#33)
#35Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#34)
#36Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#35)
#37David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#36)
#38Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#37)
#39Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#38)
#40Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#39)
#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Yuya Watari (#40)
#42David Rowley
dgrowleyml@gmail.com
In reply to: Alvaro Herrera (#41)
#43Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#42)
#44Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#43)
#45Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#44)
#46Andrei Lepikhov
lepihov@gmail.com
In reply to: Andrei Lepikhov (#45)
#47Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#46)
#48Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Andrei Lepikhov (#46)
#49Yuya Watari
watari.yuya@gmail.com
In reply to: Ashutosh Bapat (#48)
#50Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#49)
#51Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#50)
#52Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Yuya Watari (#49)
#53Yuya Watari
watari.yuya@gmail.com
In reply to: Ashutosh Bapat (#52)
#54Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#53)
#55Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Andrei Lepikhov (#54)
#56Andrei Lepikhov
lepihov@gmail.com
In reply to: Ashutosh Bapat (#55)
#57Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Andrei Lepikhov (#56)
#58Yuya Watari
watari.yuya@gmail.com
In reply to: Ashutosh Bapat (#57)
#59David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#44)
#60David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#59)
#61David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#58)
#62Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#61)
#63Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#62)
#64Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Yuya Watari (#63)
#65Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#63)
#66Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#65)
#67Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Yuya Watari (#66)
#68Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#66)
#69Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#68)
#70Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Yuya Watari (#69)
#71John Naylor
john.naylor@enterprisedb.com
In reply to: Alena Rybakina (#70)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: John Naylor (#71)
#73Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Tom Lane (#72)
#74Andrei Lepikhov
lepihov@gmail.com
In reply to: Yuya Watari (#69)
#75Yuya Watari
watari.yuya@gmail.com
In reply to: Andrei Lepikhov (#74)
#76Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#75)
#77Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#76)
#78Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Yuya Watari (#77)
#79Yuya Watari
watari.yuya@gmail.com
In reply to: Alena Rybakina (#78)
#80Alena Rybakina
lena.ribackina@yandex.ru
In reply to: Yuya Watari (#79)
#81Yuya Watari
watari.yuya@gmail.com
In reply to: Alena Rybakina (#80)
#82Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Yuya Watari (#81)
#83Yuya Watari
watari.yuya@gmail.com
In reply to: Ashutosh Bapat (#82)
#84jian he
jian.universality@gmail.com
In reply to: Yuya Watari (#83)
#85Yuya Watari
watari.yuya@gmail.com
In reply to: jian he (#84)
#86Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#85)
#87Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#86)
#88Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#87)
#89Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Yuya Watari (#88)
#90Yuya Watari
watari.yuya@gmail.com
In reply to: Dmitry Dolgov (#89)
#91Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Yuya Watari (#90)
#92Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Ashutosh Bapat (#91)
#93Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Alvaro Herrera (#92)
#94Yuya Watari
watari.yuya@gmail.com
In reply to: Ashutosh Bapat (#93)
#95Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Yuya Watari (#94)
#96Yuya Watari
watari.yuya@gmail.com
In reply to: Alvaro Herrera (#95)
#97Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Yuya Watari (#96)
#98Yuya Watari
watari.yuya@gmail.com
In reply to: Alvaro Herrera (#97)
#99Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#98)
#100Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#99)
#101Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#100)
#102Yuya Watari
watari.yuya@gmail.com
In reply to: Yuya Watari (#101)
#103newtglobal postgresql_contributors
postgresql_contributors@newtglobalcorp.com
In reply to: Yuya Watari (#102)
#104David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#102)
#105Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#104)
#106David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#105)
#107Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#106)
#108David Rowley
dgrowleyml@gmail.com
In reply to: Yuya Watari (#107)
#109Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#108)
#110Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#108)
#111Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#110)
#112Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#108)
#113Yuya Watari
watari.yuya@gmail.com
In reply to: Tom Lane (#109)
#114Yuya Watari
watari.yuya@gmail.com
In reply to: Ashutosh Bapat (#110)
#115David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#109)
#116David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#115)
#117Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#116)
#118Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#116)
#119Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#116)
#120David Rowley
dgrowleyml@gmail.com
In reply to: Ashutosh Bapat (#119)
#121David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#118)
#122Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#121)
#123David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#122)
#124David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#123)
#125Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#124)
#126Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#124)
#127Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#124)
#128David Rowley
dgrowleyml@gmail.com
In reply to: Ashutosh Bapat (#127)
#129Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#128)
#130Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#128)
#131David Rowley
dgrowleyml@gmail.com
In reply to: Ashutosh Bapat (#130)
#132Yuya Watari
watari.yuya@gmail.com
In reply to: David Rowley (#131)
#133Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#131)
#134David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#133)
#135Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#134)
#136Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#135)
#137David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#135)
#138Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: David Rowley (#137)
#139Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#137)
#140David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#139)
#141Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#140)
#142David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#141)
#143Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#142)