Performance issue in foreign-key-aware join estimation
In connection with David Rowley's proposal to change bitmapset.c to use
64-bit words, I dug out an old test case I had for a complex-to-plan query
(attached). Andres Freund posted this to the lists perhaps ten years ago,
though I can't locate the original posting right now.
I was distressed to discover via perf that 69% of the runtime of this
test now goes into match_eclasses_to_foreign_key_col(). That seems
clearly unacceptable. There aren't an unreasonable number of foreign key
constraints in the underlying schema --- mostly one per table, and there's
only one table with as many as 3. However, poking around in the planner
data structures, it turns out there are:
888 base relations
1005 EquivalenceClasses
167815 fkey_list entries initially
690 fkey_list entries after match_foreign_keys_to_quals trims them
So the reason match_eclasses_to_foreign_key_col is so dominant in the
runtime is it's invoked 167815 times and has to scan a thousand
EquivalenceClasses (unsuccessfully) on most of those calls.
How did the fkey_list get that big? I think the issue is that the
query touches the same tables many many times (888 baserels, but
there are only 20 distinct tables in the schema) and we get an
O(N^2) growth in the apparent number of FKs.
Clearly, we ought to rethink that data structure. I'm not sure
offhand how to make it better, but this is pretty awful.
Perhaps there'd also be some use in having better indexing for
the EquivalenceClass list, but again I'm not sure what that'd
look like.
regards, tom lane
On Fri, 21 Dec 2018 at 06:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I was distressed to discover via perf that 69% of the runtime of this
test now goes into match_eclasses_to_foreign_key_col(). That seems
clearly unacceptable.
Agreed. That's pretty terrible.
I looked at this a bit and see that
match_eclasses_to_foreign_key_col() is missing any smarts to skip
equivalence classes that don't have ec_relids bits for both rels. With
that added the run-time is reduced pretty dramatically.
I've only tested with a debug build as of now, but I get:
Unpatched:
$ pgbench -n -T 60 -f query.sql postgres
latency average = 18411.604 ms
Patched:
latency average = 8748.177 ms
Going by my profiler this drops match_eclasses_to_foreign_key_col()
down to just 10% of total planner time for this query. The new
bms_is_member() call is pretty hot inside that function though.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
speed_up_matching_fkeys_to_eclasses.patchapplication/octet-stream; name=speed_up_matching_fkeys_to_eclasses.patchDownload+5-0
On Sat, 22 Dec 2018 at 09:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
Going by my profiler this drops match_eclasses_to_foreign_key_col()
down to just 10% of total planner time for this query. The new
bms_is_member() call is pretty hot inside that function though.
I should have said 28% instead of 10%. 10% is the time spent
exclusively just in that function (not in a function called by that
function). The bms_is_member() call I mention above is about 20% of
the total time, which likely includes some other call sites too.
Back in [1]/messages/by-id/CAKJS1f_2SnXhPVa6eWjzy2O9A=ocwgd0Cj-LQeWpGtrWqbUSDA@mail.gmail.com, I mentioned that I'd like to start moving away from our
linked list based implementation of List and start using an array
based version instead. If we had this we could easily further improve
this code fkey matching code to not even look near equivalence classes
that don't contain members for the relations in question with a design
something like:
1. Make PlannerInfo->eq_classes an array list instead of an array,
this will significantly improve the performance of list_nth().
2. Have a Bitmapset per relation that indexes which items in
eq_classes have ec_members for the given relation.
3. In match_eclasses_to_foreign_key_col() perform a bms_overlap() on
the eq_classes index bitmapsets for the relations at either side of
the foreign key then perform a bms_next_member() type loop over the
result of that in order to skip over the eq_classes items that can't
match.
Since match_foreign_keys_to_quals() calls
match_eclasses_to_foreign_key_col() once for each item in
PlannerInfo->fkey_list (167815 items in this case) and again for each
foreign key column in those keys, then this should significantly
reduce the effort required since we make a pass over *every*
equivalence class each time match_eclasses_to_foreign_key_col() gets
called.
This array list implementation is something I did want to get one for
PG12. The height of the bar to do this is pretty high given what was
mentioned in [2]/messages/by-id/21592.1509632225@sss.pgh.pa.us.
[1]: /messages/by-id/CAKJS1f_2SnXhPVa6eWjzy2O9A=ocwgd0Cj-LQeWpGtrWqbUSDA@mail.gmail.com
[2]: /messages/by-id/21592.1509632225@sss.pgh.pa.us
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 22 Dec 2018 at 10:53, David Rowley <david.rowley@2ndquadrant.com> wrote:
Back in [1], I mentioned that I'd like to start moving away from our
linked list based implementation of List and start using an array
based version instead. If we had this we could easily further improve
this code fkey matching code to not even look near equivalence classes
that don't contain members for the relations in question with a design
something like:1. Make PlannerInfo->eq_classes an array list instead of an array,
this will significantly improve the performance of list_nth().
2. Have a Bitmapset per relation that indexes which items in
eq_classes have ec_members for the given relation.
3. In match_eclasses_to_foreign_key_col() perform a bms_overlap() on
the eq_classes index bitmapsets for the relations at either side of
the foreign key then perform a bms_next_member() type loop over the
result of that in order to skip over the eq_classes items that can't
match.
Using the above idea, but rather than going to the trouble of storing
PlannerInfo->eq_classes as an array type list, if we build the array
on the fly inside match_foreign_keys_to_quals(), then build a
Bitmapset type index to mark which of the eclasses contains members
for each relation, then I can get the run-time for the function down
to just 0.89%. Looking at other functions appearing high on the
profile I also see; have_relevant_eclass_joinclause() (14%),
generate_join_implied_equalities_for_ecs() (23%).
I think if we seriously want to improve planning performance when
there are many stored equivalence classes, then we need to have
indexing along the lines of what I've outlined above.
I've attached the patch I used to test this idea. It might be
possible to develop this into something committable, perhaps if we
invent a new function in equivclass.c that builds the index into a
single struct and we pass a pointer to that down to the functions that
require the index. Such a function could also optionally skip
indexing eclasses such as ones with ec_has_volatile or ec_has_const
when the use case for the index requires ignoring such eclasses.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
poc_eclass_indexing_v1.patchapplication/octet-stream; name=poc_eclass_indexing_v1.patchDownload+56-6
On Mon, 24 Dec 2018 at 09:38, David Rowley <david.rowley@2ndquadrant.com> wrote:
Using the above idea, but rather than going to the trouble of storing
PlannerInfo->eq_classes as an array type list, if we build the array
on the fly inside match_foreign_keys_to_quals(), then build a
Bitmapset type index to mark which of the eclasses contains members
for each relation, then I can get the run-time for the function down
to just 0.89%. Looking at other functions appearing high on the
profile I also see; have_relevant_eclass_joinclause() (14%),
generate_join_implied_equalities_for_ecs() (23%).
I've now expanded the proof of concept patch to use this indexing
technique for have_relevant_eclass_joinclause() and
generate_join_implied_equalities_for_ecs(). With Tom's test from
up-thread, I get:
Master:
latency average = 14125.374 ms
Patched:
latency average = 2417.164 ms
There are some other cases, such as
generate_implied_equalities_for_column(), that are possibly also
indexable, but in that case, we cannot use ec_relids to help build the
index since it does not keep track of other member relation
equivalence class members. That function is appearing at about 3.3% of
total plan time with the patched version of the code, so there's still
some small gains to be had there. The performance of
has_relevant_eclass_joinclause() could likely also be improved from
this indexing. According to my profiling, it's currently still about
2.6% of total planning time with the patched version.
I've attached the updated (rough) proof of concept patch. I ended up
stuffing the equivalence class index structure into PlannerInfo so
that it would be available in all the places it was required, but
build just once higher up the call stack. I don't believe this is the
correct solution for a finished patch, but I didn't really have any
better ideas and this seemed good enough to demonstrate what the
performance could look like.
Other ideas I have to further improve the performance of this query
would be to move the fkey_list out of PlannerInfo and instead include
a per-relation list inside RelOptInfo. This would allow
get_foreign_key_join_selectivity() to just look at foreign keys that
are relevant to the given relations rather than having to skip all
foreign keys that are not. This function is still accounting for about
5.5% of the total planning time for this query. I imagine it wouldn't
hurt match_foreign_keys_to_quals() too much to have it loop over each
RelOptInfo and look at the foreign keys defined on each of those.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
poc_eclass_indexing_v2.patchapplication/octet-stream; name=poc_eclass_indexing_v2.patchDownload+202-9
On 12/24/18 1:07 PM, David Rowley wrote:
On Mon, 24 Dec 2018 at 09:38, David Rowley <david.rowley@2ndquadrant.com> wrote:
Using the above idea, but rather than going to the trouble of storing
PlannerInfo->eq_classes as an array type list, if we build the array
on the fly inside match_foreign_keys_to_quals(), then build a
Bitmapset type index to mark which of the eclasses contains members
for each relation, then I can get the run-time for the function down
to just 0.89%. Looking at other functions appearing high on the
profile I also see; have_relevant_eclass_joinclause() (14%),
generate_join_implied_equalities_for_ecs() (23%).I've now expanded the proof of concept patch to use this indexing
technique for have_relevant_eclass_joinclause() and
generate_join_implied_equalities_for_ecs(). With Tom's test from
up-thread, I get:Master:
latency average = 14125.374 msPatched:
latency average = 2417.164 ms
Yes, I can confirm these measurements. On my machine, timing on master
is about 10530ms, with v1 of the patch it drops to 2600ms and v2 pushes
it down to 1610ms.
I however observe failures on 4 regression test suites - inherit,
equivclass, partition_join and partition_prune (diff attached). That's a
bit surprising, because AFAICS the patch merely optimizes the execution
and should not change the planning otherwise (all the failures seem to
be a query plan changing in some way). I'm not sure if the plans are
correct or better than the old ones.
The other thing is that we probably should not use a single test case to
measure the optimization - I doubt it can improve less extreme queries,
but maybe we should verify it does not regress them?
With the patch attached, bms_overlap gets quite high in the profiles. I
think one reason for that is that all bitmap operations (not just
bms_overlap) always start the loop at 0. For the upper boundary is
usually determined as Min() of the lengths, but we don't do that for
lower boundary because we don't track that. The bitmaps for eclasses are
likely sparse, so this is quite painful. Attaches is a simple patch that
adds tracking of "minword" and uses it in various bms_ methods to skip
initial part of the loop. On my machine this reduces the timings by
roughly 5% (from 1610 to 1530 ms).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 25 Dec 2018 at 13:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I however observe failures on 4 regression test suites - inherit,
equivclass, partition_join and partition_prune (diff attached). That's a
bit surprising, because AFAICS the patch merely optimizes the execution
and should not change the planning otherwise (all the failures seem to
be a query plan changing in some way). I'm not sure if the plans are
correct or better than the old ones.
Seems I didn't run the tests after doing a last-minute move of the
create_eclass_index() call in make_one_rel(). I'd previously had it
just below set_base_rel_pathlists(), which meant that the index didn't
exist in some cases so it would fall back on the original code. It
appears the use case call from set_base_rel_pathlists() require
is_child eclass members to be indexed too, but these were not in v1 or
v2 since ec_relids does not record child rel members.
It seems simple enough to fix this just by adding ec_allrelids and
setting it for all members rather than just non-children members.
This also allows me to add the two additional cases to allow
generate_implied_equalities_for_column() and
has_relevant_eclass_joinclause() to also make use of the eclass index.
This further reduces the total planning time for the query on my
machine to 2304 ms.
The other thing is that we probably should not use a single test case to
measure the optimization - I doubt it can improve less extreme queries,
but maybe we should verify it does not regress them?
Agreed. That still needs to be verified. Although I'm yet unsure what
sort of form we could use this idea in. I wonder if it's fine to
create this eclass index on the fly, or if it's something we should
keep maintained all the time. The problem I saw with keeping it all
the time is down to eq_classes being a List and list_nth() is slow,
which means the bms_next_member() loops I've added would perform
poorly when compiled with a linked list lookup.
With the patch attached, bms_overlap gets quite high in the profiles. I
think one reason for that is that all bitmap operations (not just
bms_overlap) always start the loop at 0. For the upper boundary is
usually determined as Min() of the lengths, but we don't do that for
lower boundary because we don't track that. The bitmaps for eclasses are
likely sparse, so this is quite painful. Attaches is a simple patch that
adds tracking of "minword" and uses it in various bms_ methods to skip
initial part of the loop. On my machine this reduces the timings by
roughly 5% (from 1610 to 1530 ms).
Seems like an interesting idea, although for the most part, Bitmapsets
are small and this adds some overhead to all use cases. I doubt it is
worth the trouble for bms_is_member(), since that does not need to
perform a loop over each bitmapword. It looks like most of the
bms_overlap() usages are caused by functions like
have_join_order_restriction(), join_is_legal(),
check_outerjoin_delay() and add_paths_to_joinrel(), all of which are
performing a loop over the join_info_list. Perhaps that could be
indexed a similar way to the eq_classes List. I'd imagine there's
about another 20-30% of performance to squeeze out of this by doing
that, plus a bit more by making the fkey_list per RelOptInfo.
Attached is v3 of the hacked up proof of concept performance demo patch.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
poc_eclass_indexing_v3.patchapplication/octet-stream; name=poc_eclass_indexing_v3.patchDownload+331-9
On 12/25/18 3:48 AM, David Rowley wrote:
On Tue, 25 Dec 2018 at 13:46, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
I however observe failures on 4 regression test suites - inherit,
equivclass, partition_join and partition_prune (diff attached). That's a
bit surprising, because AFAICS the patch merely optimizes the execution
and should not change the planning otherwise (all the failures seem to
be a query plan changing in some way). I'm not sure if the plans are
correct or better than the old ones.Seems I didn't run the tests after doing a last-minute move of the
create_eclass_index() call in make_one_rel(). I'd previously had it
just below set_base_rel_pathlists(), which meant that the index didn't
exist in some cases so it would fall back on the original code. It
appears the use case call from set_base_rel_pathlists() require
is_child eclass members to be indexed too, but these were not in v1 or
v2 since ec_relids does not record child rel members.It seems simple enough to fix this just by adding ec_allrelids and
setting it for all members rather than just non-children members.
This also allows me to add the two additional cases to allow
generate_implied_equalities_for_column() and
has_relevant_eclass_joinclause() to also make use of the eclass index.
This further reduces the total planning time for the query on my
machine to 2304 ms.
OK, that makes sense.
The other thing is that we probably should not use a single test case to
measure the optimization - I doubt it can improve less extreme queries,
but maybe we should verify it does not regress them?Agreed. That still needs to be verified. Although I'm yet unsure what
sort of form we could use this idea in. I wonder if it's fine to
create this eclass index on the fly, or if it's something we should
keep maintained all the time. The problem I saw with keeping it all
the time is down to eq_classes being a List and list_nth() is slow,
which means the bms_next_member() loops I've added would perform
poorly when compiled with a linked list lookup.
Yeah, good questions. I think the simplest thing we could do is building
them on the first access - that would at least ensure we don't build the
index without accessing it at least once.
Of course, it might still be faster to do the check directly, for small
numbers of eclasses / fkeys and rels. Perhaps there's some sort of
heuristics deciding when to build the indexes, but that seems like an
overkill at this point.
What I propose is constructing a "minimal" simple query invoking this
code (something like two rels with a join on a foreign key) and measure
impact on that. That seems like a fairly realistic use case.
ALso, I wonder what is te goal here - how much do we need to redude the
duration to be happy? Initially I'd say "on par with the state before
the FK patch" but I guess we're already optimizing beyond that point.
What was the timing before adding the FK stuff?
With the patch attached, bms_overlap gets quite high in the profiles. I
think one reason for that is that all bitmap operations (not just
bms_overlap) always start the loop at 0. For the upper boundary is
usually determined as Min() of the lengths, but we don't do that for
lower boundary because we don't track that. The bitmaps for eclasses are
likely sparse, so this is quite painful. Attaches is a simple patch that
adds tracking of "minword" and uses it in various bms_ methods to skip
initial part of the loop. On my machine this reduces the timings by
roughly 5% (from 1610 to 1530 ms).Seems like an interesting idea, although for the most part, Bitmapsets
are small and this adds some overhead to all use cases. I doubt it is
worth the trouble for bms_is_member(), since that does not need to
perform a loop over each bitmapword.
Perhaps. The bitmapsets are generally small in number of members, but
the values may be actually quite high in some cases (I don't have any
exact statistics, though).
You're right it adds a bit of overhead, but I'd expect that to be
outweighted by the reduction of cache misses. But 5% is fairly close to
noise.
It looks like most of the bms_overlap() usages are caused by
functions like have_join_order_restriction(), join_is_legal(),
check_outerjoin_delay() and add_paths_to_joinrel(), all of which are
performing a loop over the join_info_list. Perhaps that could be
indexed a similar way to the eq_classes List. I'd imagine there's
about another 20-30% of performance to squeeze out of this by doing
that, plus a bit more by making the fkey_list per RelOptInfo.
Looks promising.
Attached is v3 of the hacked up proof of concept performance demo patch.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 26 Dec 2018 at 09:50, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Yeah, good questions. I think the simplest thing we could do is building
them on the first access - that would at least ensure we don't build the
index without accessing it at least once.
I think we first need to focus on what is back-patchable here. The
problem I see with the equivalence class index idea is that it would
require passing the index down into
match_eclasses_to_foreign_key_col() which is not a static function, so
we can't really go changing its signature on a backbranch.
Another idea would be to create a new version of
match_eclasses_to_foreign_key_col() which uses the index, which would
mean we'd not break any extensions that might happen to use
match_eclasses_to_foreign_key_col().
Ideally, the quick fix in the v1 patch would be good enough for the
backbranches, but a quick bit of benchmarking shows that there's still
a big regression to what the performance is like without the foreign
keys.
(Average of EXPLAIN over 60 seconds)
foreign key qual matching code commented out: 2486.204 ms
Master: 13909.551 ms
v1 patch: 7310.719 ms
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 22 Dec 2018 at 09:28, David Rowley <david.rowley@2ndquadrant.com> wrote:
On Fri, 21 Dec 2018 at 06:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I was distressed to discover via perf that 69% of the runtime of this
test now goes into match_eclasses_to_foreign_key_col(). That seems
clearly unacceptable.Agreed. That's pretty terrible.
I looked at this a bit and see that
match_eclasses_to_foreign_key_col() is missing any smarts to skip
equivalence classes that don't have ec_relids bits for both rels. With
that added the run-time is reduced pretty dramatically.I've only tested with a debug build as of now, but I get:
Unpatched:
$ pgbench -n -T 60 -f query.sql postgres
latency average = 18411.604 msPatched:
latency average = 8748.177 ms
So that this does not get lost, I've added an entry for the original
patch for the March commitfest.
While the patch does not bring the performance back to what it was
before this code was added, it makes a massive dent in the additional
overhead.
https://commitfest.postgresql.org/22/1984/
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, 5 Feb 2019 at 22:43, David Rowley <david.rowley@2ndquadrant.com> wrote:
So that this does not get lost, I've added an entry for the original
patch for the March commitfest.
Attaching the original patch again so the commitfest bot gets off my
back about the other one not applying.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
speed_up_matching_fkeys_to_eclasses.patchapplication/octet-stream; name=speed_up_matching_fkeys_to_eclasses.patchDownload+5-0
David Rowley <david.rowley@2ndquadrant.com> writes:
Attaching the original patch again so the commitfest bot gets off my
back about the other one not applying.
Pushed that one. I'm interested by the "POC" patch, but I agree
that it'd take some research to show that it isn't a net negative
for simple queries. It sounds like you're not really interested
in pursuing that right now?
Anyway, I rebased the POC patch up to HEAD, just in case anyone
still wants to play with it.
regards, tom lane
Attachments:
poc_eclass_indexing_v4.patchtext/x-diff; charset=us-ascii; name=poc_eclass_indexing_v4.patchDownload+330-17
On Thu, 21 Feb 2019 at 15:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Pushed that one. I'm interested by the "POC" patch, but I agree
that it'd take some research to show that it isn't a net negative
for simple queries. It sounds like you're not really interested
in pursuing that right now?
Thanks for pushing it.
I'm still interested in the POC patch. I just knew it wasn't something
for the back branches and thought something that was would be more
important... Things having gone quiet here wasn't a good source of
inspiration to do any further work on it. It's good to hear you're
interested.
Anyway, I rebased the POC patch up to HEAD, just in case anyone
still wants to play with it.
Cool. Thanks.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes:
On Thu, 21 Feb 2019 at 15:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Anyway, I rebased the POC patch up to HEAD, just in case anyone
still wants to play with it.
Cool. Thanks.
I haven't done any of the performance testing that this patch
clearly needs, but just in a quick look through the code:
* I seriously dislike the usage of root->eq_classes, primarily the
underdocumented way that it means one thing in some phases of the
planner and something else in other phases. You seem to be doing that
in hopes of saving some memory, but is the index data structure really
large enough to matter? This scheme is certainly unlikely to continue
to work if we add any additional uses of EquivalenceClassIndexes.
So I think it might be better to pass them around as explicit
arguments and/or attach them someplace else than PlannerInfo.
* I'm also not very excited about having both a fast and slow path
in places like has_relevant_eclass_joinclause() depending on whether
the index exists or not. IMO, if we're going to do this at all,
we should just go over to requiring the index to exist when needed,
and get rid of the slow paths. (A possible variant to that is "build
the index structure on-demand", though you'd have to be careful about
GEQO memory management.) Otherwise we'll forever be fighting hidden
planner-performance bugs of the form "if you call function xyz from
here, it'll be way slower than you expected".
* There's not much point in making EquivalenceClassIndex a Node type
if you're not going to wire it up to any of the Node infrastructure
(particularly outfuncs.c, which might be useful for debug purposes).
But I'm not really sure that it ought to be a distinct data structure
at all --- maybe this requirement should be more tightly integrated
with the ECs themselves? One idea of what that might look like is to
let each base RelOptInfo have a list of associated EquivalenceClasses,
so that you'd only have to search through directly-relevant ECs when
trying to prove something. But I'm just handwaving here.
* Spell check please, particularly EQUIVALANCE.
* Documentation of the data structure is pretty weak, eg what is
"this relation" in the comment about ei_index? And how do you
know how long the arrays are, or what they're indexed by? And
there's no explicit statement that ei_flags is a bitmask of those
other symbols, much less any real statement of what each flag means.
Setting the CF entry to WOA for now. I wonder though if we should
just push it out to v13 immediately --- are you intending to do more
with it in the near future?
regards, tom lane
On Sat, 9 Mar 2019 at 08:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Setting the CF entry to WOA for now. I wonder though if we should
just push it out to v13 immediately --- are you intending to do more
with it in the near future?
Thanks a lot for going over this and providing feedback. I put the
patch out there mostly to see if such a thing is something we might
want, and it's good to see no objections to the idea. I didn't want to
waste too much time if there was going to be some serious objections
to the idea.
This is something I'd like to look into for v13. I think it'll be
much easier to do if we can get your List reimplementation patch in
first. That's going to allow list_nth on PlannerInfo->eq_classes to
work quickly and will remove the need for having to build an array to
store the classes, and as you mention, RelOptInfo could store a
Bitmapset to store which ECs have members for this rel. I've
modified the CF entry to target v13.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sat, 9 Mar 2019 at 13:18, David Rowley <david.rowley@2ndquadrant.com> wrote:
This is something I'd like to look into for v13. I think it'll be
much easier to do if we can get your List reimplementation patch in
first. That's going to allow list_nth on PlannerInfo->eq_classes to
work quickly and will remove the need for having to build an array to
store the classes, and as you mention, RelOptInfo could store a
Bitmapset to store which ECs have members for this rel. I've
modified the CF entry to target v13.
I started looking at this again and I came up with the attached
eclass_indexes_v2.patch. This modifies the original patch
significantly so that we no longer build this big eclass index when we
think we might start to need it. Instead, I've added a Bitmapset
field to RelOptInfo named "eclass_indexes". During add_eq_member() we
just note down the PlannerInfo->eq_classes index of the eclass we're
adding to in each RelOptInfo->eclass_indexes that's involved in the
new eclass member being added. This means we can use the Bitmapset
lookup anytime we like. That allowed me to get rid of those special
cases I had added to make use of the index when it exists. I've now
modified all the existing code to make use of the
RelOptInfo->eclass_indexes field.
As mentioned above, my thoughts on this patch were that this new
method would only be any good once we got Tom's list implementation
patch as that makes list_nth() O(1) instead of O(N). On benchmarking
this I was quite surprised that it still improves performance quite a
bit even without the list reimplementation patch.
Using Tom's test case [1]/messages/by-id/6970.1545327857@sss.pgh.pa.us, I get the following:
* master (82a5649fb)
latency average = 6992.320 ms
latency average = 6990.297 ms
latency average = 7030.619 ms
* master + eclass_indexes_v2
latency average = 2537.536 ms
latency average = 2530.824 ms
latency average = 2543.770 ms
If I add in Tom's list reimplementation too, I get:
* master + Tom's list reimplementation v3 + eclass_indexes
latency average = 1225.910 ms
latency average = 1209.565 ms
latency average = 1207.326 ms
I really didn't expect just this patch alone to speed this up as it
calls list_nth() in a loop. I can only imagine that with this
workload that these list_nth() loops are not performing many loops,
otherwise, the performance would die off quickly. I thought about how
we could defend against workloads where there's a large number of
eclasses and most match a given relation. I ended up with a small
function named list_skip_forward() that simply keeps looping for N
times eating N ListCells into the given ListCell. This does improve
performance a little bit more, but probably, more importantly, should
be a good defence against the worst case. As is, the function would
become completely redundant with Tom's list reimplementation patch, so
for now, I just snuck it in as a static function in equivclass.c.
I've attached this list_skip_forward.patch too. This patch is a bit
rough around the edges as I only just started playing with it in the
past 30 mins or so. Perhaps it shows that we might be able to fix
this in PG12 and not have to wait for the list reimplementation at
all.
The performance with both attached patches is:
* master + eclass_indexes + list_skip_forward_v0
latency average = 2375.383 ms
latency average = 2351.450 ms
latency average = 2339.259 ms
Still nowhere near as good as with the list reimplementation patch,
but way better than master.
[1]: /messages/by-id/6970.1545327857@sss.pgh.pa.us
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sun, 10 Mar 2019 at 21:34, David Rowley <david.rowley@2ndquadrant.com> wrote:
I started looking at this again and I came up with the attached
eclass_indexes_v2.patch.
The CF bot didn't like v2. It warned about an uninitialized variable.
My compiler didn't.
Here's v3, hopefully with that fixed.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
eclass_indexes_v3.patchapplication/octet-stream; name=eclass_indexes_v3.patchDownload+232-66
David Rowley <david.rowley@2ndquadrant.com> writes:
[ eclass_indexes_v3.patch ]
I looked at this for a little bit. I'm on board with the basic idea
of assigning integer indexes to the ECs and keeping bitmapsets of
relevant EC indexes in RelOptInfos. However ... man, is that
delete_eclass() thing ugly. And slow, and fragile-feeling.
I think that maybe we could improve that situation by not trying to
maintain the RelOptInfo.eclass_indexes sets at all during the initial
construction of the ECs, but only setting them up after we've completed
doing that. We already have an assumption that we know when EC merging
is done and it's safe to make pathkeys that reference the "canonical"
(merged) ECs, so it seems like that would be a point where we could
build the indexing bitmapsets safely. This hinges on not needing the
bitmapsets till after that point, but it'd be easy to arrange some
Asserts verifying that we don't try to use them before that.
If that doesn't work (because we need the index info sooner), maybe we
could consider never removing ECs from eq_classes, so that their indices
never change, then teaching most/all of the loops over eq_classes to
ignore entries with non-null ec_merged. But we probably want the indexing
bitmapsets to reflect canonical EC numbers, so we'd still have to do some
updating I fear -- or else be prepared to chase up the ec_merged links
when using the index bitmaps.
Stepping back a little bit, I wonder whether now is the time to rethink
the overall EC data structure, as foreseen in this old comment:
* Note: constructing merged EquivalenceClasses is a standard UNION-FIND
* problem, for which there exist better data structures than simple lists.
* If this code ever proves to be a bottleneck then it could be sped up ---
* but for now, simple is beautiful.
I'm not very sure what a better data structure would really look like ---
after perusing Sedgewick's section on UNION-FIND, it seems like the
ec_merged links are more or less what he's recommending anyway, and the
expensive part of this is probably all the equal() tests to find whether
two proposed EC member expressions are already in the set of known
expressions. So I'm just handwaving here. But my point is that we
needn't feel wedded to the idea of keeping the ECs in a list, if we can
think of some better data structure for them.
regards, tom lane
Thanks for looking at this.
On Mon, 18 Mar 2019 at 10:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I looked at this for a little bit. I'm on board with the basic idea
of assigning integer indexes to the ECs and keeping bitmapsets of
relevant EC indexes in RelOptInfos. However ... man, is that
delete_eclass() thing ugly. And slow, and fragile-feeling.
Yeah. When I discovered we remove eclasses from the list I was a
little annoyed as the code was pretty simple before having to account
for that. I'll grant you ugly and slow, but I don't quite see the
fragile part.
I think that maybe we could improve that situation by not trying to
maintain the RelOptInfo.eclass_indexes sets at all during the initial
construction of the ECs, but only setting them up after we've completed
doing that. We already have an assumption that we know when EC merging
is done and it's safe to make pathkeys that reference the "canonical"
(merged) ECs, so it seems like that would be a point where we could
build the indexing bitmapsets safely. This hinges on not needing the
bitmapsets till after that point, but it'd be easy to arrange some
Asserts verifying that we don't try to use them before that.
I don't think that's really that easily workable. Consider, for
example, when add_child_rel_equivalences() is called, this is well
after the location I think you have in mind. Probably this function
should be modified to use the indexes, but it must also update the
indexes too.
If that doesn't work (because we need the index info sooner), maybe we
could consider never removing ECs from eq_classes, so that their indices
never change, then teaching most/all of the loops over eq_classes to
ignore entries with non-null ec_merged. But we probably want the indexing
bitmapsets to reflect canonical EC numbers, so we'd still have to do some
updating I fear -- or else be prepared to chase up the ec_merged links
when using the index bitmaps.
That's probably a better solution. Perhaps we can continue to nullify
the ec_members, then just skip eclasses with a NIL ec_members. I
avoided that in the patch because I was worried about what extension
might be doing, but if you think it's okay, then I can change the
patch.
Stepping back a little bit, I wonder whether now is the time to rethink
the overall EC data structure, as foreseen in this old comment:* Note: constructing merged EquivalenceClasses is a standard UNION-FIND
* problem, for which there exist better data structures than simple lists.
* If this code ever proves to be a bottleneck then it could be sped up ---
* but for now, simple is beautiful.
I've thought for a while now that it wouldn't be too hard to have
equal(), or some version of it return -1, 0, 1 to allow us to binary
search or build a binary search tree of nodes. I imagine it wouldn't
be too hard to create a compact binary search tree inside an array
with indexes to the left and right sub-tree instead of pointers. That
would likely be a bit more cache friendly and allow simple
non-recursive traversals of the entire tree. However, that does not
sound like something this patch should be doing.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, 18 Mar 2019 at 14:06, David Rowley <david.rowley@2ndquadrant.com> wrote:
On Mon, 18 Mar 2019 at 10:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:
If that doesn't work (because we need the index info sooner), maybe we
could consider never removing ECs from eq_classes, so that their indices
never change, then teaching most/all of the loops over eq_classes to
ignore entries with non-null ec_merged. But we probably want the indexing
bitmapsets to reflect canonical EC numbers, so we'd still have to do some
updating I fear -- or else be prepared to chase up the ec_merged links
when using the index bitmaps.That's probably a better solution. Perhaps we can continue to nullify
the ec_members, then just skip eclasses with a NIL ec_members. I
avoided that in the patch because I was worried about what extension
might be doing, but if you think it's okay, then I can change the
patch.
I've modified the patch to do it this way.
The only loop over eq_classes I saw outside of equivclass.c was in
postgres_fdw.c. This just calls eclass_useful_for_merging() on each
EC in the list. Instead of having that loop skip deleted ECs, I
changed eclass_useful_for_merging() so that it just returns false for
that case.
The only other thing I change was to create a new function named
get_common_eclass_indexes() which removes some duplicate code where we
were getting ECs common to two relations. I also made it so this
function does not allocate unnecessary Bitmapsets when the inputs are
simple relations.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services