Performance regression with PostgreSQL 11 and partitioning

Started by Thomas Reissabout 8 years ago41 messageshackers

thomas.reiss@dalibo.com

about 8 years ago

Hello,

I spent some time to test the new features on partitioning with the
beta1. I noticed a potentially huge performance regression with
plan-time partition pruning.

To show the issue, I used this DO statement to generate some partitions,
one per day :
DO $$
DECLARE
part_date date;
ddl text;
BEGIN
CREATE TABLE t1 (
num INTEGER NOT NULL,
dt DATE NOT NULL
) PARTITION BY LIST (dt);

FOR part_date IN SELECT d FROM generate_series(date '2010-01-01',
'2020-12-31', interval '1 day') d LOOP
ddl := 'CREATE TABLE t1_' || to_char(part_date, 'YYYY_MM_DD') || E'
PARTITION OF t1 FOR VALUES IN (\'' || part_date || E'\')';
EXECUTE ddl;
END LOOP;
END;
$$;

Then I used the following to compare the planning time :
explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';

With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
143ms. I also did a little test with more than 20k partitions, and while
the planning time was reasonable with PG10 (287.453 ms), it exploded
with v11 with 4578.054 ms.

Perf showed that thes functions find_appinfos_by_relids and
bms_is_member consumes most of the CPU time with v11. With v10, this
functions don't appear. It seems that find_appinfos_by_relids was
introduced by commit 480f1f4329f.

Regards,
Thomas

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Thomas Reiss (#1)

Re: Performance regression with PostgreSQL 11 and partitioning

On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:

Then I used the following to compare the planning time :
explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';

With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
143ms. I also did a little test with more than 20k partitions, and while
the planning time was reasonable with PG10 (287.453 ms), it exploded
with v11 with 4578.054 ms.

Perf showed that thes functions find_appinfos_by_relids and
bms_is_member consumes most of the CPU time with v11. With v10, this
functions don't appear. It seems that find_appinfos_by_relids was
introduced by commit 480f1f4329f.

Hmm. Have you verified whether that commit is actually the one that
caused the regression? It's certainly possible, but I wouldn't expect
calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
more expensive than calling find_childrel_appendrelinfo() as the
previous code did. I wonder if some later change, perhaps related to
pruning, just caused this code path to be hit more often.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thomas Reiss

thomas.reiss@dalibo.com

about 8 years ago

In reply to: Robert Haas (#2)

Re: Performance regression with PostgreSQL 11 and partitioning

Le 25/05/2018 à 16:49, Robert Haas a écrit :

On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:

Then I used the following to compare the planning time :
explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';

With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
143ms. I also did a little test with more than 20k partitions, and while
the planning time was reasonable with PG10 (287.453 ms), it exploded
with v11 with 4578.054 ms.

Perf showed that thes functions find_appinfos_by_relids and
bms_is_member consumes most of the CPU time with v11. With v10, this
functions don't appear. It seems that find_appinfos_by_relids was
introduced by commit 480f1f4329f.

Hmm. Have you verified whether that commit is actually the one that
caused the regression? It's certainly possible, but I wouldn't expect
calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
more expensive than calling find_childrel_appendrelinfo() as the
previous code did. I wonder if some later change, perhaps related to
pruning, just caused this code path to be hit more often.

I didn't because I didn't enough time. I'll take another look next week.

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 8 years ago

In reply to: Robert Haas (#2)

Re: Performance regression with PostgreSQL 11 and partitioning

On Fri, May 25, 2018 at 11:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 25, 2018 at 10:30 AM, Thomas Reiss <thomas.reiss@dalibo.com> wrote:

Then I used the following to compare the planning time :
explain (analyze) SELECT * FROM t1 WHERE dt = '2018-05-25';

With PostgreSQL 10, planning time is 66ms, in v11, planning rise to
143ms. I also did a little test with more than 20k partitions, and while
the planning time was reasonable with PG10 (287.453 ms), it exploded
with v11 with 4578.054 ms.

Perf showed that thes functions find_appinfos_by_relids and
bms_is_member consumes most of the CPU time with v11. With v10, this
functions don't appear. It seems that find_appinfos_by_relids was
introduced by commit 480f1f4329f.

Hmm. Have you verified whether that commit is actually the one that
caused the regression? It's certainly possible, but I wouldn't expect
calling find_appinfos_by_relids() with 1 AppendRelInfo to be too much
more expensive than calling find_childrel_appendrelinfo() as the
previous code did. I wonder if some later change, perhaps related to
pruning, just caused this code path to be hit more often.

One more possibility is that find_appinfos_by_relids being shown high
up in profiles is called from apply_scanjoin_target_to_paths that's
new in PG 11, which in turn is called (unconditionally) from
grouping_planner. Especially, the following bit in it:

/*
* If the relation is partitioned, recurseively apply the same changes to
* all partitions and generate new Append paths. Since Append is not
* projection-capable, that might save a separate Result node, and it also
* is important for partitionwise aggregate.
*/
if (rel->part_scheme && rel->part_rels)
{
int partition_idx;
List *live_children = NIL;

/* Adjust each partition. */
for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
{
RelOptInfo *child_rel = rel->part_rels[partition_idx];
ListCell *lc;
AppendRelInfo **appinfos;
int nappinfos;
List *child_scanjoin_targets = NIL;

/* Translate scan/join targets for this child. */
appinfos = find_appinfos_by_relids(root, child_rel->relids,
&nappinfos);

Seems here that we call find_appinfos_by_relids here for *all*
partitions, even if all but one partition may have been pruned. I
haven't studied this code in detail, but I suspect it might be
unnecessary, although I might be wrong.

Fwiw, I'm not sure why the new pruning code would call here, at least
git grep find_appinfos_by_relids doesn't turn up anything interesting
in that regard.

Thanks,
Amit

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Amit Langote (#4)

Re: Performance regression with PostgreSQL 11 and partitioning

On Fri, May 25, 2018 at 1:53 PM, Amit Langote <amitlangote09@gmail.com> wrote:

Seems here that we call find_appinfos_by_relids here for *all*
partitions, even if all but one partition may have been pruned. I
haven't studied this code in detail, but I suspect it might be
unnecessary, although I might be wrong.

Uggh. It might be possible to skip it for dummy children. That would
leave the dummy child rels generating a different pathtarget than the
non-dummy children, but I guess if we never use them again maybe it
wouldn't matter.

Fwiw, I'm not sure why the new pruning code would call here, at least
git grep find_appinfos_by_relids doesn't turn up anything interesting
in that regard.

Hmm, OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

about 8 years ago

In reply to: Robert Haas (#5)

Re: Performance regression with PostgreSQL 11 and partitioning

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 25, 2018 at 1:53 PM, Amit Langote <amitlangote09@gmail.com> wrote:

Seems here that we call find_appinfos_by_relids here for *all*
partitions, even if all but one partition may have been pruned. I
haven't studied this code in detail, but I suspect it might be
unnecessary, although I might be wrong.

Uggh. It might be possible to skip it for dummy children. That would
leave the dummy child rels generating a different pathtarget than the
non-dummy children, but I guess if we never use them again maybe it
wouldn't matter.

I think this line of thought is shooting the messenger. The core of the
problem here is that the appinfo list, which is a data structure designed
with the expectation of having a few dozen entries at most, has got four
thousand entries (or twenty thousand if you try Thomas' bigger case).
What we need, if we're going to cope with that, is something smarter than
linear searches.

I'm not exactly impressed with the design concept of
find_appinfos_by_relids in itself, either, as what that does is to
linearly search the appinfo list to produce ... an array, which also
has to be linearly searched. In this particular example, it seems
that all the output arrays have length 1; if they did not then we'd
be seeing the search loops in adjust_appendrel_attrs et al taking
up unreasonable amounts of time as well. (Not to mention the palloc
cycles and unreclaimed space eaten by said arrays.)

I'm inclined to think that we should flush find_appinfos_by_relids
altogether, along with these inefficient intermediate arrays, and instead
have the relevant places in adjust_appendrel_attrs call some function
defined as "gimme the AppendRelInfo for child rel A and parent rel B,
working directly from the PlannerInfo root data". That could use a hash
lookup when dealing with more than some-small-number of AppendRelInfos,
comparable to what we do with join_rel_list/join_rel_hash.

I also wonder whether it's really necessary to do a fresh search
for each individual Var, as I see is currently happening in
adjust_appendrel_attrs_mutator. At the very least we could expect
that there would be runs of requests for the same parent/child pair,
and avoid fresh list searches/hash probes when the answer is the
same as last time. But do we really have to leave that for the bottom
level of the recursion, rather than doing it once per expr at some higher
call level?

Maybe it's all right to decide that this rejiggering can be left
for v12 ... did we promise anyone that it's now sane to use thousands
of partitions?

regards, tom lane

Jonathan S. Katz

jkatz@postgresql.org

about 8 years ago

In reply to: Tom Lane (#6)

Re: Performance regression with PostgreSQL 11 and partitioning

On May 25, 2018, at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it's all right to decide that this rejiggering can be left
for v12 ... did we promise anyone that it's now sane to use thousands
of partitions?

Per beta release, we’ve only said “improved SELECT query performance due
to enhanced partition elimination during query processing and execution as
well as parallelized partition scans” with the disclaimers of “subject to change
due to beta testing” and “please test and report back.”

In other words, no promises on the above.

However, the question is how common is the above scenario? If you’re doing
partitions by day, it would take awhile to get to 20K. But if you have something
where you have so much inbound data that you need to partition by hour, it
can occur a little bit more quickly (though will still take a couple of years, if you
were to start partitioning today).

Jonathan

Justin Pryzby

pryzby@telsasoft.com

about 8 years ago

In reply to: Tom Lane (#6)

Re: Performance regression with PostgreSQL 11 and partitioning

On Fri, May 25, 2018 at 05:17:12PM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, May 25, 2018 at 1:53 PM, Amit Langote <amitlangote09@gmail.com> wrote:

Seems here that we call find_appinfos_by_relids here for *all*
partitions, even if all but one partition may have been pruned. I
haven't studied this code in detail, but I suspect it might be
unnecessary, although I might be wrong.

Uggh. It might be possible to skip it for dummy children. That would
leave the dummy child rels generating a different pathtarget than the
non-dummy children, but I guess if we never use them again maybe it
wouldn't matter.

Maybe it's all right to decide that this rejiggering can be left
for v12 ... did we promise anyone that it's now sane to use thousands
of partitions?

https://www.postgresql.org/docs/devel/static/ddl-partitioning.html
|The following caveats apply to CONSTRAINT EXCLUSION:
|[...]
|All constraints on all partitions of the master table are examined during
|constraint exclusion, so large numbers of partitions are likely to increase
|query planning time considerably. So the legacy inheritance based partitioning
|will work well with up to perhaps a hundred partitions; DON'T TRY TO USE MANY
|THOUSANDS OF PARTITIONS.

It doesn't matter for us, as we're already using tooo many partitions, but I
would interpret that to mean that thousands of partitions are a job exclusively
for "partition pruning" of declarative partitions.

Justin

David Rowley

dgrowleyml@gmail.com

about 8 years ago

In reply to: Tom Lane (#6)

Re: Performance regression with PostgreSQL 11 and partitioning

On 26 May 2018 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that we should flush find_appinfos_by_relids
altogether, along with these inefficient intermediate arrays, and instead
have the relevant places in adjust_appendrel_attrs call some function
defined as "gimme the AppendRelInfo for child rel A and parent rel B,
working directly from the PlannerInfo root data". That could use a hash
lookup when dealing with more than some-small-number of AppendRelInfos,
comparable to what we do with join_rel_list/join_rel_hash.

For partitioned tables, a child just has a single parent, so to speed
up find_appinfos_by_relids we'd simply just need a faster way to find
the AppendRelInfo by child relid.

I've been working on a patch series which I plan to submit to PG12
aimed to speed up partitioning. Ideally, I'd have liked to just tag
the AppendRelInfo onto the child RelOptInfo, but that falls down
during inheritance planning.

In the end I just made an array to store AppendRelInfo's by their
child_relid which is created and populated during
setup_simple_rel_arrays.

Probably the patch could go a bit further and skip array allocation
when the append_rel_list is empty, but I've been busy working on
another bunch of stuff to improve planning time. I'd planned to give
this another look before submitting in the PG12 cycle, so state ==
WIP.

A quick test of the attached on Thomas' 4k part test, I get:

Unpatched
tps = 5.957508 (excluding connections establishing)

Patched:
tps = 15.368806 (excluding connections establishing)

Using that, if Thomas sees the same speedup then that puts PG11 at
55.4ms vs his measured 66ms in PG10.

The attached should apply with some fuzz to master.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

David Rowley

dgrowleyml@gmail.com

about 8 years ago

In reply to: Thomas Reiss (#1)

Re: Performance regression with PostgreSQL 11 and partitioning

On 26 May 2018 at 02:30, Thomas Reiss <thomas.reiss@dalibo.com> wrote:

I spent some time to test the new features on partitioning with the
beta1. I noticed a potentially huge performance regression with
plan-time partition pruning.

Thanks for reporting. I've added this item to the PG11 open items list in:

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

I've placed it there so that it does not get forgotten about. It's
still to be decided if we can come up with something low-risk enough
that will resolve the issue.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

Thomas Reiss

thomas.reiss@dalibo.com

almost 8 years ago

In reply to: David Rowley (#9)

Re: Performance regression with PostgreSQL 11 and partitioning

Le 30/05/2018 à 03:57, David Rowley a écrit :

On 26 May 2018 at 09:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that we should flush find_appinfos_by_relids
altogether, along with these inefficient intermediate arrays, and instead
have the relevant places in adjust_appendrel_attrs call some function
defined as "gimme the AppendRelInfo for child rel A and parent rel B,
working directly from the PlannerInfo root data". That could use a hash
lookup when dealing with more than some-small-number of AppendRelInfos,
comparable to what we do with join_rel_list/join_rel_hash.

For partitioned tables, a child just has a single parent, so to speed
up find_appinfos_by_relids we'd simply just need a faster way to find
the AppendRelInfo by child relid.

I've been working on a patch series which I plan to submit to PG12
aimed to speed up partitioning. Ideally, I'd have liked to just tag
the AppendRelInfo onto the child RelOptInfo, but that falls down
during inheritance planning.

In the end I just made an array to store AppendRelInfo's by their
child_relid which is created and populated during
setup_simple_rel_arrays.

Probably the patch could go a bit further and skip array allocation
when the append_rel_list is empty, but I've been busy working on
another bunch of stuff to improve planning time. I'd planned to give
this another look before submitting in the PG12 cycle, so state ==
WIP.

A quick test of the attached on Thomas' 4k part test, I get:

Unpatched
tps = 5.957508 (excluding connections establishing)

Patched:
tps = 15.368806 (excluding connections establishing)

Using that, if Thomas sees the same speedup then that puts PG11 at
55.4ms vs his measured 66ms in PG10.

The attached should apply with some fuzz to master.

Thanks for this quick patch David, unfortunately I didn't have time to
play around with it until now.

On the 4k part test, the planning time falls down to 55 ms instead of
the 4 seconds from the current head.
I also have better results with pgbench. Your patch solves this planning
time issue.

Thanks David !

#12

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 8 years ago

In reply to: David Rowley (#9)

Re: Performance regression with PostgreSQL 11 and partitioning

On Wed, May 30, 2018 at 7:27 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

In the end I just made an array to store AppendRelInfo's by their
child_relid which is created and populated during
setup_simple_rel_arrays.

May be we want to change the name of the function or plural "arrays" conveys it?

Probably the patch could go a bit further and skip array allocation
when the append_rel_list is empty, but I've been busy working on
another bunch of stuff to improve planning time. I'd planned to give
this another look before submitting in the PG12 cycle, so state ==
WIP.

I was wondering if we can get rid of append_rel_list altogether. In
your patch, you have saved AppendRelInfos by child_relid. So all the
slots indexed by parent relid are empty. We could place AppendRelInfos
by parent relid. Thus a given AppendRelInfo would be places at two
places, one at the index of child relid and second at the index
pointed by parent relid. That's fine even in case of multilevel
inheritance since an intermediate parent has two relids one as a
parent and other as a child.

One problem with that we do not know how long the array would be to
start with. But we could keep on repallocing the array to increase its
size.

It then helps us in places where we iterate over append_rel_list
looking for a parent id. It does not help in places like the loop
below in inheritance_planner().

foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);

if (bms_is_member(appinfo->parent_relid, subqueryRTindexes) ||
bms_is_member(appinfo->child_relid, subqueryRTindexes) ||
bms_overlap(pull_varnos((Node *) appinfo->translated_vars),
subqueryRTindexes))
modifiableARIindexes = bms_add_member(modifiableARIindexes,
appinfo->child_relid);
}

A quick test of the attached on Thomas' 4k part test, I get:

Unpatched
tps = 5.957508 (excluding connections establishing)

Patched:
tps = 15.368806 (excluding connections establishing)

That's really good.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#13

David Rowley

dgrowleyml@gmail.com

almost 8 years ago

In reply to: Ashutosh Bapat (#12)

Re: Performance regression with PostgreSQL 11 and partitioning

On 5 June 2018 at 01:35, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I was wondering if we can get rid of append_rel_list altogether. In
your patch, you have saved AppendRelInfos by child_relid. So all the
slots indexed by parent relid are empty. We could place AppendRelInfos
by parent relid. Thus a given AppendRelInfo would be places at two
places, one at the index of child relid and second at the index
pointed by parent relid. That's fine even in case of multilevel
inheritance since an intermediate parent has two relids one as a
parent and other as a child.

One problem with that we do not know how long the array would be to
start with. But we could keep on repallocing the array to increase its
size.

To be honest, the patch was meant as a discussion topic for ways to
improve the reported regression in PG11. The patch was put together
fairly quickly.

I've not really considered what happens in the case where an inherited
table has multiple parents. The patch happens to pass regression
tests, but I've not checked to see if the existing tests create any
tables this way.

Perhaps it's okay to change things this way for just partitioned
tables and leave inheritance hierarchies alone.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 8 years ago

In reply to: David Rowley (#13)

Re: Performance regression with PostgreSQL 11 and partitioning

On Tue, Jun 5, 2018 at 5:50 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 5 June 2018 at 01:35, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I was wondering if we can get rid of append_rel_list altogether. In
your patch, you have saved AppendRelInfos by child_relid. So all the
slots indexed by parent relid are empty. We could place AppendRelInfos
by parent relid. Thus a given AppendRelInfo would be places at two
places, one at the index of child relid and second at the index
pointed by parent relid. That's fine even in case of multilevel
inheritance since an intermediate parent has two relids one as a
parent and other as a child.

One problem with that we do not know how long the array would be to
start with. But we could keep on repallocing the array to increase its
size.

To be honest, the patch was meant as a discussion topic for ways to
improve the reported regression in PG11. The patch was put together
fairly quickly.

I think the idea is brilliant. I do not have objections for trying
something in that direction. I am suggesting that we take this a bit
forward and try to eliminate append_rel_list altogether.

I've not really considered what happens in the case where an inherited
table has multiple parents. The patch happens to pass regression
tests, but I've not checked to see if the existing tests create any
tables this way.

AFAIK, that case doesn't arise while processing a query. Examining the
way we create AppendRelInfos during expand_inherited_tables(), it's
clear that we create only one AppendRelInfo for a given child and also
for a given child and parent pair. If there are two tables from which
a child inherits, the child's RTE/RelOptInfo is associated only with
the top-most parent that appears in the query. See following comment
from find_all_inheritors()
/*
* Add to the queue only those children not already seen. This avoids
* making duplicate entries in case of multiple inheritance paths from
* the same parent. (It'll also keep us from getting into an infinite
* loop, though theoretically there can't be any cycles in the
* inheritance graph anyway.)
*/

Perhaps it's okay to change things this way for just partitioned
tables and leave inheritance hierarchies alone.

There's no point having two separate code paths when internally we are
treating them as same. The only difference between partitioning and
inheritance is that for multi-level partitioned table, we preserve the
inheritance hierarchy whereas for multi-level inheritance, we flatten
it.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#15

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 8 years ago

In reply to: Ashutosh Bapat (#14)

Re: Performance regression with PostgreSQL 11 and partitioning

On 2018/06/05 13:44, Ashutosh Bapat wrote:

On Tue, Jun 5, 2018 at 5:50 AM, David Rowley wrote:

On 5 June 2018 at 01:35, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I was wondering if we can get rid of append_rel_list altogether. In
your patch, you have saved AppendRelInfos by child_relid. So all the
slots indexed by parent relid are empty. We could place AppendRelInfos
by parent relid. Thus a given AppendRelInfo would be places at two
places, one at the index of child relid and second at the index
pointed by parent relid. That's fine even in case of multilevel
inheritance since an intermediate parent has two relids one as a
parent and other as a child.

One problem with that we do not know how long the array would be to
start with. But we could keep on repallocing the array to increase its
size.

To be honest, the patch was meant as a discussion topic for ways to
improve the reported regression in PG11. The patch was put together
fairly quickly.

I think the idea is brilliant. I do not have objections for trying
something in that direction. I am suggesting that we take this a bit
forward and try to eliminate append_rel_list altogether.

Imho, we should try to refrain from implementing something that needs
repalloc'ing the array, as long as we're doing it as a fix for PG 11.
Also, because getting rid of append_rel_list altogether seems a bit involved.

Let the AppendRelInfo's be added to append_rel_list when created, as is
done now (expand_inherited_tables), and transfer them to the array when
setting up various arrays (setup_simple_rel_arrays) as the David's patch does.

I've not really considered what happens in the case where an inherited
table has multiple parents. The patch happens to pass regression
tests, but I've not checked to see if the existing tests create any
tables this way.

AFAIK, that case doesn't arise while processing a query. Examining the
way we create AppendRelInfos during expand_inherited_tables(), it's
clear that we create only one AppendRelInfo for a given child and also
for a given child and parent pair. If there are two tables from which
a child inherits, the child's RTE/RelOptInfo is associated only with
the top-most parent that appears in the query. See following comment
from find_all_inheritors()
/*
* Add to the queue only those children not already seen. This avoids
* making duplicate entries in case of multiple inheritance paths from
* the same parent. (It'll also keep us from getting into an infinite
* loop, though theoretically there can't be any cycles in the
* inheritance graph anyway.)
*/

That's right.

create table p1 (a int);
create table p2 (b int);
create table c () inherits (p1, p2);

Child table 'c' has two parents but if a query specifies only p1, then
just one AppendRelInfo corresponding to p1-c pair will be created. If p2
was also specified in the query, there will be an AppendRelInfo for p2-c
pair too, but its child_relid will refer to another RT entry that's
created for 'c', that is, the one created when expanding p2.

Perhaps it's okay to change things this way for just partitioned
tables and leave inheritance hierarchies alone.

There's no point having two separate code paths when internally we are
treating them as same.

The only difference between partitioning and
inheritance is that for multi-level partitioned table, we preserve the
inheritance hierarchy whereas for multi-level inheritance, we flatten
it.

Yeah, the proposed patch doesn't change anything about how many
AppendRelInfo's are created, nor about what's contained in them, so it
seems safe to say that it would work unchanged for both regular
inheritance and partitioning.

Thanks,
Amit

#16

David Rowley

dgrowleyml@gmail.com

almost 8 years ago

In reply to: Ashutosh Bapat (#14)

Re: Performance regression with PostgreSQL 11 and partitioning

On 5 June 2018 at 16:44, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I think the idea is brilliant. I do not have objections for trying
something in that direction. I am suggesting that we take this a bit
forward and try to eliminate append_rel_list altogether.

I was trying to be realistic for something we can do to fix v11. It's
probably better to minimise the risky surgery on this code while in
beta. What I proposed was intended to fix a performance regression new
in v11. I'm not sure what you're proposing has the same intentions.

Probably, if you do want an efficient way to store the children
belonging to a parent we could just have another array of Bitmapsets
which would just serve as a set of indexes into the array I've added.

I'd prefer to get a committers thoughts on this before doing any further work.

I've attached a cleaned up patch from the last one. This just adds
some sanity checks to make sure we get an ERROR if we do ever see two
AppendRelInfos with the same child relation id. I've also made it so
the append_rel_array is only allocated when there are append rels.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 8 years ago

In reply to: David Rowley (#16)

Re: Performance regression with PostgreSQL 11 and partitioning

On Wed, Jun 6, 2018 at 11:27 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

I was trying to be realistic for something we can do to fix v11. It's
probably better to minimise the risky surgery on this code while in
beta. What I proposed was intended to fix a performance regression new
in v11. I'm not sure what you're proposing has the same intentions.

Agreed that we want a less risky fix at this stage. What I am worried
is with your implementation there are two ways to get AppendRelInfo
corresponding to a child, append_rel_list and append_rel_array. Right
now we will change all the code to use append_rel_array but there is
no guarantee that some code in future will use append_rel_list. Bugs
would rise if these two get out of sync esp. considering
append_rel_list is a list which can be easily modified. I think we
should avoid that. See what we do to rt_fetch() and
planner_rt_fetch(), but we still have two ways to get an RTE. That's a
source of future bugs.

Probably, if you do want an efficient way to store the children
belonging to a parent we could just have another array of Bitmapsets
which would just serve as a set of indexes into the array I've added.

I was actually wrong when I proposed that we store AppendRelInfos
indexed by parent_id in the same array. You are right; there can be
multiple children with same parent id. I like your solution of having
an array of bitmapsets, members within which are indexes into
append_rel_array.

I'd prefer to get a committers thoughts on this before doing any further work.

Yes. I think so too.

I've attached a cleaned up patch from the last one. This just adds
some sanity checks to make sure we get an ERROR if we do ever see two
AppendRelInfos with the same child relation id. I've also made it so
the append_rel_array is only allocated when there are append rels.

In find_appinfos_by_relids(), we should Assert that the child_id in
the AppendRelInfo obtained from the array is same as its index. My
guess is that we could actually get rid of child_relid member of
AppendRelInfo altogether if we directly store AppendRelInfos in the
array without creating a list first. But that may be V12 material.

Also in the same function we want to Assert that cnt never exceeds *nappinfo.

+ Assert(root->append_rel_array);
root->append_rel_array != NULL seems to be a preferred style in the code.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#18

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: Ashutosh Bapat (#17)

Re: Performance regression with PostgreSQL 11 and partitioning

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:

On Wed, Jun 6, 2018 at 11:27 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

I was trying to be realistic for something we can do to fix v11. It's
probably better to minimise the risky surgery on this code while in
beta. What I proposed was intended to fix a performance regression new
in v11. I'm not sure what you're proposing has the same intentions.

Agreed that we want a less risky fix at this stage. What I am worried
is with your implementation there are two ways to get AppendRelInfo
corresponding to a child, append_rel_list and append_rel_array. Right
now we will change all the code to use append_rel_array but there is
no guarantee that some code in future will use append_rel_list. Bugs
would rise if these two get out of sync esp. considering
append_rel_list is a list which can be easily modified. I think we
should avoid that. See what we do to rt_fetch() and
planner_rt_fetch(), but we still have two ways to get an RTE. That's a
source of future bugs.

I'd prefer to get a committers thoughts on this before doing any further work.

Yes. I think so too.

As the original author of the append_rel_list stuff, I do have a few
thoughts here.

The reason why append_rel_list is just a list, and not any more
complicated data structure, is alluded to in the comments for
find_childrel_appendrelinfo (which David's patch should have
changed, and did not):

* This search could be eliminated by storing a link in child RelOptInfos,
* but for now it doesn't seem performance-critical. (Also, it might be
* difficult to maintain such a link during mutation of the append_rel_list.)

There are a lot of places in prepjointree.c and planner.c that whack
around the append_rel_list contents more or less extensively. It's
a lot easier in those places if the data is referenced by just one
list pointer. I do not think we want to replace append_rel_list
with something that would make those functions' lives much harder.

However, so far as I can see, the append_rel_list contents don't change
anymore once we enter query_planner(). Therefore, it's safe to build
secondary indexing structure(s) to allow fast access beyond that point.
This is not much different from what we do with, eg, the rangetable
and simple_rte_array[].

So that's basically what David's patch does, and it seems fine as far
as it goes, although I disapprove of shoving the responsibility into
setup_simple_rel_arrays() without so much as a comment change.
I'd make a separate function for that, I think. (BTW, perhaps instead
we should do what the comment quoted above contemplates, and set up a
link in the child's RelOptInfo?)

I'm still of the opinion that find_appinfos_by_relids() needs to be
nuked from orbit. It has nothing to recommend it either from the
standpoint of performance or that of intellectual coherency (or maybe
that problem is just inadequate documentation). The places consuming
its results are no better.

I was also pretty unhappy to discover, as I poked around in the code, that
recent partitioning patches have introduced various assumptions about the
ordering of the append_rel_list. It's bad enough that those exist at all;
it's worse that they're documented, if at all, only beside the code that
will fail (usually silently) if they're violated. I do not find this
acceptable. If we cannot avoid these assumptions, they need to be
documented more centrally, like in the relation.h block comment for struct
AppendRelInfo.

regards, tom lane

#19

David Rowley

dgrowleyml@gmail.com

almost 8 years ago

In reply to: Tom Lane (#18)

Re: Performance regression with PostgreSQL 11 and partitioning

On 8 June 2018 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So that's basically what David's patch does, and it seems fine as far
as it goes, although I disapprove of shoving the responsibility into
setup_simple_rel_arrays() without so much as a comment change.
I'd make a separate function for that, I think. (BTW, perhaps instead
we should do what the comment quoted above contemplates, and set up a
link in the child's RelOptInfo?)

I had originally wanted to stuff these into RelOptInfo, but I
discovered it was not really possible because of how the inheritance
planner works. It replaces the parent with the child RelOptInfo for
each child and replans the query over and over, meaning we'd never
have a complete list of AppendRelInfos to work with.

I'm still of the opinion that find_appinfos_by_relids() needs to be
nuked from orbit. It has nothing to recommend it either from the
standpoint of performance or that of intellectual coherency (or maybe
that problem is just inadequate documentation). The places consuming
its results are no better.

Yeah, I agree it's not nice that it pallocs an array then pfrees it
again. adjust_appendrel_attrs and adjust_child_relids could probably
just accept a RelIds of childrels and find the AppendRelInfos itself,
similar to how I coded find_appinfos_by_relids.

Do you see that as something for PG11? Keep in mind, I did code the
patch in a way to minimise invasiveness having considered that PG11 is
now in beta. I'm willing to write a patch that gets rid of
find_appinfos_by_relids completely. There are a few places, e.g.
create_partitionwise_grouping_paths that make use of the appinfos
multiple times, but the direct lookup is probably fast enough that
this would not matter.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Tom Lane

tgl@sss.pgh.pa.us

almost 8 years ago

In reply to: David Rowley (#19)

Re: Performance regression with PostgreSQL 11 and partitioning

David Rowley <david.rowley@2ndquadrant.com> writes:

On 8 June 2018 at 08:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm still of the opinion that find_appinfos_by_relids() needs to be
nuked from orbit.

Yeah, I agree it's not nice that it pallocs an array then pfrees it
again. adjust_appendrel_attrs and adjust_child_relids could probably
just accept a RelIds of childrels and find the AppendRelInfos itself,
similar to how I coded find_appinfos_by_relids.

Why do they need to accept any additional parameters at all?

The pre-v11 incarnation of those functions took a single AppendRelInfo,
specifying an exact translation from one parent relid to one child
relid. The fundamental problem I've got with the current code, entirely
independently of any performance issues, is that it's completely unclear
-- or at least undocumented -- which translation(s) are supposed to occur.
If we assume that the code isn't 100% broken (which I'm hardly convinced
of, but we'll take it as read for the moment) then it must be that the
translations are constrained to be well-determined by the query structure
as represented by the totality of the AppendRelInfo data. So I'm thinking
maybe we don't need those parameters at all. In the pre-v11 situation,
we were saving lookups by passing the only applicable AppendRelInfo to
the translation functions. But if the translation functions have to
look up the right translation anyway, let's just make them do that,
and create whatever infrastructure we have to have to make it fast.