Partition-wise join for join between (declaratively) partitioned tables

Started by Ashutosh Bapatover 9 years ago308 messages
#1Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
3 attachment(s)

Amit Langote is working on supporting declarative partitioning in
PostgreSQL [1]. /messages/by-id/55D3093C.5010800@lab.ntt.co.jp. I have started working on supporting partition-wise join.
This mail describes very high level design and some insight into the
performance improvements.

An equi-join between two partitioned tables can be broken down into
pair-wise join between their partitions. This technique is called
partition-wise join. Partition-wise joins between similarly partitioned
tables with equi-join condition can be efficient because [2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf
1. Each provably non-empty partition-wise join smaller. All such joins
collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes,
their storage etc.
3. An N-way partition-wise join may have different efficient join orders
compared to the efficient join order between the parent tables.

A partition-wise join is processed in following stages [2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf, [3]. https://users.cs.duke.edu/~shivnath/tmp/paqo_draft.pdf.
1. Applicability testing: This phase checks if the join conditions match
the partitioning scheme. A partition-wise join is efficient if there is an
equi-join on the partition keys. E.g. join between tables R and S
partitioned by columns a and b resp. can be broken down into partition-wise
joins if there exists a join condition is R.a = S.b. Or in other words the
number of provably non-empty partition-wise joins is O(N) where N is the
number of partitions.

2. Matching: This phase determines which joins between the partitions of R
and S can potentially produce tuples in the join and prunes empty joins
between partitions.

3. Clustering: This phase aims at reducing the number of partition-wise
joins by clubbing together partitions from joining relations. E.g. clubbing
multiple partitions from either of the partitioned relations which can join
to a single partition from the other partitioned relation.

4. Path/plan creation: This phase creates multiple paths for each
partition-wise join. It also creates Append path/s representing the union
of partition-wise joins.

The work here focuses on a subset of use-cases discussed in [2]. https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf. It only
considers partition-wise join for join between similarly partitioned tables
with same number of partitions with same properties, thus producing at most
as many partition-wise joins as there are partitions. It should be possible
to apply partition-wise join technique (with some special handling for
OUTER joins) if both relations have some extra partitions with
non-overlapping partition conditions, apart from the matching partitions.
But I am not planning to implement this optimization in the first cut.

The attached patch is a POC implementation of partition-wise join. It is is
based on the set of patches posted on 23rd May 2016 by Amit Langote for
declarative partitioning. The patch gives an idea about the approach used.
It has several TODOs, which I am working on.

Attached is a script with output which measures potential performance
improvement because of partition-wise join. The script uses a GUC
enable_partition_wise_join to disable/enable this feature for performance
measurement. The scripts measures performance improvement of a join between
two tables partitioned by range on integer column. Each table contains 50K
rows. Each table has an integer and a varchar column. It shows around
10-15% reduction in execution time when partition-wise join is used.
Accompanied with parallel query and FDWs, it opens up avenues for further
improvements for joins between partitioned tables.

[1]: . /messages/by-id/55D3093C.5010800@lab.ntt.co.jp
[2]: . https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf
[3]: . https://users.cs.duke.edu/~shivnath/tmp/paqo_draft.pdf

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

partitioned_join.outapplication/octet-stream; name=partitioned_join.out
partitioned_join.sqltext/x-sql; charset=US-ASCII; name=partitioned_join.sql
pg_dp_join_POC.patchtext/x-diff; charset=US-ASCII; name=pg_dp_join_POC.patch
#2Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#1)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jun 15, 2016 at 3:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Amit Langote is working on supporting declarative partitioning in PostgreSQL
[1]. I have started working on supporting partition-wise join. This mail
describes very high level design and some insight into the performance
improvements.

An equi-join between two partitioned tables can be broken down into
pair-wise join between their partitions. This technique is called
partition-wise join. Partition-wise joins between similarly partitioned
tables with equi-join condition can be efficient because [2]
1. Each provably non-empty partition-wise join smaller. All such joins
collectively might be more efficient than the join between their parent.
2. Such joins are able to exploit properties of partitions like indexes,
their storage etc.
3. An N-way partition-wise join may have different efficient join orders
compared to the efficient join order between the parent tables.

A partition-wise join is processed in following stages [2], [3].
1. Applicability testing: This phase checks if the join conditions match the
partitioning scheme. A partition-wise join is efficient if there is an
equi-join on the partition keys. E.g. join between tables R and S
partitioned by columns a and b resp. can be broken down into partition-wise
joins if there exists a join condition is R.a = S.b. Or in other words the
number of provably non-empty partition-wise joins is O(N) where N is the
number of partitions.

2. Matching: This phase determines which joins between the partitions of R
and S can potentially produce tuples in the join and prunes empty joins
between partitions.

3. Clustering: This phase aims at reducing the number of partition-wise
joins by clubbing together partitions from joining relations. E.g. clubbing
multiple partitions from either of the partitioned relations which can join
to a single partition from the other partitioned relation.

4. Path/plan creation: This phase creates multiple paths for each
partition-wise join. It also creates Append path/s representing the union of
partition-wise joins.

The work here focuses on a subset of use-cases discussed in [2]. It only
considers partition-wise join for join between similarly partitioned tables
with same number of partitions with same properties, thus producing at most
as many partition-wise joins as there are partitions. It should be possible
to apply partition-wise join technique (with some special handling for OUTER
joins) if both relations have some extra partitions with non-overlapping
partition conditions, apart from the matching partitions. But I am not
planning to implement this optimization in the first cut.

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables. Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition. If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks. Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.
And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization. Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers. If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables. That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#2)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables. Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition. If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks. Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.

And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization. Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers. If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables. That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my
last
mail on this thread.

Two partitioned relations are considered for partition-wise join if
following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on
the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys,
paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient constraints
[1]: e.g. joins between tables with different number of partitions but having same bounds/lists for the common partitions. I am planning to defer that to a later version of this feature.
e.g. joins between tables with different number of partitions but having
same
bounds/lists for the common partitions. I am planning to defer that to a
later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo)
structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition
keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The
logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the
joins
can and can not use partition-wise join technique e.g.
"NOTICE: join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the
query.
OR
"NOTICE: join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path
creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_assess_phase.patchapplication/x-download; name=pg_dp_join_assess_phase.patch
#4Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#3)
Re: Partition-wise join for join between (declaratively) partitioned tables

Sorry forgot to mention: this patch applies on top of the v7 patches posted
by Amit Langote on 27th June (
/messages/by-id/81371428-bb4b-1e33-5ad6-8c5c51b52cb7@lab.ntt.co.jp
).

On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables. Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition. If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks. Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.

Or we might be able to use indexes directly without need of a MergeAppend.

And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization. Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers. If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables. That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

+1.

The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my
last
mail on this thread.

Two partitioned relations are considered for partition-wise join if
following
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on
the
upper bounds.
5. There exists an equi-join condition for each pair of partition keys,
paired
by the position in which they appear.

Partition-wise join technique can be applied under more lenient
constraints [1]
e.g. joins between tables with different number of partitions but having
same
bounds/lists for the common partitions. I am planning to defer that to a
later
version of this feature.

A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations
with
the partition keys combined from the joining relations.

A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo)
structure
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the
given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition
keys
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The
logic to
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
implemented.

Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the
joins
can and can not use partition-wise join technique e.g.
"NOTICE: join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the
query.
OR
"NOTICE: join between relations (b 1) and (b 2) is NOT considered for
partition-wise join.".
These messages are for debugging only, and will be removed once path
creation
logic is implemented.

The patch adds a test partition_join.sql, which has a number of positive
and
negative testcases for joins between partitioned tables.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#5Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#4)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The
patch
is based on the declarative parition support patches provided by Amit
Langote
on 26th August 2016. The previous patch added support to assess whether two
tables can be joined using partition-wise join technique, but did not have
complete support to create plans which used partition-wise technique. This
patch implements three important pieces for supporting partition-wise join

1. Logic to assess whether join between two partitioned tables can be
executed
using partition-wise join technique.
2. Construct RelOptInfo's representating join between matching partitions of
the joining relations and add join paths to those RelOptInfo's
3. Add append paths to the RelOptInfo representing the join between
partitioned
tables. Rest of the planner code chooses the optimal path for join.

make_join_rel() now calls try_partition_wise_join(), which executes all of
the
steps listed above. If the joining partitioned relations are deemed fit for
partition-wise join, we create one RelOptInfo (if not already present)
representing a join between every pair of partitions to be joined. Since the
join between parents is deemed legal, the join between the partitions is
also
legal, hence legality of the join is not checked again. RelOptInfo
representing
the join between partitions is constructed by translating the relevant
members
of RelOptInfo of the parent join relation. Similarly SpecialJoinInfo,
restrictlist (for given join order) are constructed by translating those for
the parent join.

make_join_rel() is split into two portions, a. that deals with constructing
restrictlist and RelOptInfo for join relation b. that creates paths for the
join. The second portion is separated into a function
populate_joinrel_with_paths(), which is reused in try_partition_wise_join()
to
create paths for join between matching partitions.

set_append_rel_pathlist() generates paths for child relations, marks the
empty
children as dummy relations and creates append paths by collecting paths
with
similar properties (parameterization and pathkeys) from non-empty children.
It
then adds append paths to the parent relation. This patch divides
set_append_rel_pathlist() into two parts a. marking empty child relations as
dummy and generating paths for non-empty children. b. collecting children
paths
into append paths for parent. Part b is separate into a function
add_paths_to_append_rel() which is reused for collecting paths from
partition-wise join child relations to construct append paths for join
between
partitioned tables.

For an N-way join between partitioned tables, make_join_rel() is called as
many
times as the number of valid join orders exist. For each such call, we will
add
paths to join between partitions for corresponding join order between those
partitions. We can generate the append paths for parent joinrel only after
all
such join orders have been considered. Hence before setting cheapest path
forx
parent join relation, we set the cheapest path for each join relation
between
partitions, followed by creating append paths for the parent joinrel. This
method needs some readjustment for multi-level partitions (TODO item 2
below).

A GUC enable_partition_wise_join is added to enable or disable
partition-wise
join technique. I think the GUC is useful similar to other join related GUCs
like enable_hashjoin.

parameterized paths: While creating parameterized paths for child relations
of
a partitioned tables, we do not have an idea as to whether we will be able
to
use partition-wise join technique or not. Also we do not know the child
partition of the other partitioned table, to which a given partition would
join. Hence we do not create paths parameterized by child partitions of
other
partitioned relations. But path for child of a partitioned relation
parameterized by other parent relation, can be considered to be
parameterised
by any child relation of the other partitioned relation by replacing the
parent
parameters by corresponding child parameters. This condition is used to
eliminate parameterized paths while creating merge and hash joins, to decide
the resultant parameterization of a join between child partitions and to
create
nested loop paths with inner path parameterized by outer relation where
inner
and outer relations are child partitions. While creating such nest loop join
paths we translate the path parameterized by other parent partitioned
relation,
to that parameterized by the required child.

Functions like select_outer_pathkeys_for_merge(), make_sort_from_pathkeys(),
find_ec_member_for_tle() which did not expect to be called for a child
relation, are now used for child partition relations for joins. These
functions
are adjusted for that usage.

Testing:
I have added partition_join.sql testcase to test partition-wise join
feature.
That file has extensive tests for list, range, multi-level partitioning
schemes
and various kinds of joins including nested loop join with inner relation
parameterized by outer relationThat file has extensive tests for list,
range,
multi-level partitioning schemes and various kinds of joins including nested
loop join with inner relation parameterized by outer relation.

make check passes clean.

TODOs:

1. Instead of storing partitioning information in RelOptInfo of each of the
partitioned relations (base and join relations), we can keep a list of
canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer
to
the member of list representing the partitioning scheme of corresponding
relation. RelOptInfo's of all similarly partitioned relations get the same
pointer thus making it easy to match the partitioning schemes by comparing
the
pointers. While we are supporting only exact partition matching scheme now,
it's possible to extend this method to match compatible partitioning
schemes by
maintaining a list of compatible partitioning schemes.

Right now, I have moved some partition related structures from partition.c
to
partition.h. These structures are still being reviewed and might change when
Amit Langote improves his patches. Having canonical partitioning scheme in
PlannerInfo may not require moving those structures out. So, that code is
still
under development. A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep
that
structure and its final name.

2. Multi-level partitioned tables: For some reason path created for joining
partitions are not being picked up as the cheapest paths. I think, we need
to
finalize the lower level paths before moving upwards in the partition
hierarchy. But I am yet to investigate the issue here.
RelOptInfo::parent_relid
should point to top parents rather than immediate parents.

3. Testing: need more tests for testing partition-wise join with foreign
tables
as partitions. More tests for parameterized joins for multi-level
partitioned
joins.

4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.

5. In add_paths_to_append_rel() to find the possible set of outer relations
for
generating parameterized paths for a given join. This code needs to be
adjusted
to eliminate the parent relations possible set of outer relations for a join
between child partitions.

6. Add support to reparameterize more types of paths for child relations. I
will add this once we finalize the method to reparameterize a parent path
for
child partition.

7. The patch adds make_joinrel() (name needs to be changed because of its
similariy with make_join_rel()) to construct an empty RelOptInfo for a join
between partitions. The function copies code doing the same from
build_join_rel(). build_join_rel() too can use this function, if we decide
to
retain it.

8. Few small TODOs related to code reorganization, proper function,
variable naming etc. are in the patch. pg_indent run.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join.patchtext/x-patch; charset=US-ASCII; name=pg_dp_join.patch
#6Amit Kapila
Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#5)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.

+1 to have such a function. I often need something like that whenever
I debug the optimizer code.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#5)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2016/09/09 18:47, Ashutosh Bapat wrote:

A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep
that structure and its final name.

This change has been incorporated into the latest patch I posted on Sep 9 [1]/messages/by-id/28ee345c-1278-700e-39a7-36a71f9a3b43@lab.ntt.co.jp.

Thanks,
Amit

[1]: /messages/by-id/28ee345c-1278-700e-39a7-36a71f9a3b43@lab.ntt.co.jp
/messages/by-id/28ee345c-1278-700e-39a7-36a71f9a3b43@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#5)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The
patch
is based on the declarative parition support patches provided by Amit
Langote
on 26th August 2016.

I have applied declarative partitioning patches posted by Amit Langote on
26 Aug 2016 and then partition-wise-join patch, getting below error while
make install.

../../../../src/include/nodes/relation.h:706: error: redefinition of
typedef ‘PartitionOptInfo’
../../../../src/include/nodes/relation.h:490: note: previous declaration of
‘PartitionOptInfo’ was here
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory `/home/edb/Desktop/edb_work/WO
RKDB/PG/postgresql/src/backend/access/gist'
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory `/home/edb/Desktop/edb_work/WO
RKDB/PG/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory `/home/edb/Desktop/edb_work/WO
RKDB/PG/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory `/home/edb/Desktop/edb_work/WO
RKDB/PG/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Attached the patch for the fix of above error.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

pwj_install_fix.patchtext/x-patch; charset=US-ASCII; name=pwj_install_fix.patch
#9Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#8)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The
patch
is based on the declarative parition support patches provided by Amit
Langote
on 26th August 2016.

I have applied declarative partitioning patches posted by Amit Langote on 26
Aug 2016 and then partition-wise-join patch, getting below error while make
install.

../../../../src/include/nodes/relation.h:706: error: redefinition of typedef
‘PartitionOptInfo’
../../../../src/include/nodes/relation.h:490: note: previous declaration of
‘PartitionOptInfo’ was here
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Attached the patch for the fix of above error.

Thanks for the report. I will fix this in the next patch.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#5)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.

TODOs:

1. Instead of storing partitioning information in RelOptInfo of each of the
partitioned relations (base and join relations), we can keep a list of
canonical partition schemes in PlannerInfo. Every RelOptInfo gets a pointer
to
the member of list representing the partitioning scheme of corresponding
relation. RelOptInfo's of all similarly partitioned relations get the same
pointer thus making it easy to match the partitioning schemes by comparing
the
pointers. While we are supporting only exact partition matching scheme now,
it's possible to extend this method to match compatible partitioning schemes
by
maintaining a list of compatible partitioning schemes.

Right now, I have moved some partition related structures from partition.c
to
partition.h. These structures are still being reviewed and might change when
Amit Langote improves his patches. Having canonical partitioning scheme in
PlannerInfo may not require moving those structures out. So, that code is
still
under development. A related change is renaming RangeBound structure in Amit
Langote's patches to PartitionRangeBound to avoid name conflict with
rangetypes.h. That change too should vanish once we decide where to keep
that
structure and its final name.

Done.

2. Multi-level partitioned tables: For some reason path created for joining
partitions are not being picked up as the cheapest paths. I think, we need
to
finalize the lower level paths before moving upwards in the partition
hierarchy. But I am yet to investigate the issue here.
RelOptInfo::parent_relid
should point to top parents rather than immediate parents.

Done

3. Testing: need more tests for testing partition-wise join with foreign
tables
as partitions. More tests for parameterized joins for multi-level
partitioned
joins.

Needs to be done.

4. Remove bms_to_char(): I have added this function to print Relids in the
debugger. I have found it very useful to quickly examine Relids in debugger,
which otherwise wasn't so easy. If others find it useful too, I can create a
separate patch to be considered for a separate commit.

I will take care of this after rebasing the patch on the latest
sources and latest set of patches by Amit Langote.

5. In add_paths_to_append_rel() to find the possible set of outer relations
for
generating parameterized paths for a given join. This code needs to be
adjusted
to eliminate the parent relations possible set of outer relations for a join
between child partitions.

Done.

6. Add support to reparameterize more types of paths for child relations. I
will add this once we finalize the method to reparameterize a parent path
for
child partition.

Will wait for reviewer's opinion.

7. The patch adds make_joinrel() (name needs to be changed because of its
similariy with make_join_rel()) to construct an empty RelOptInfo for a join
between partitions. The function copies code doing the same from
build_join_rel(). build_join_rel() too can use this function, if we decide
to
retain it.

This will be done as a separate cleanup patch.

8. Few small TODOs related to code reorganization, proper function,
variable naming etc. are in the patch. pg_indent run.

I have taken care of most of the TODOs. But there are still some TODOs
remaining. I will take care of those in the next version of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_v2.patchinvalid/octet-stream; name=pg_dp_join_v2.patch
#11Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#10)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi,

I got a server crash with partition_wise_join, steps to reproduce given
below.

postgres=# set enable_partition_wise_join=true;
SET
postgres=# CREATE TABLE tbl (a int,c text) PARTITION BY LIST(a);
CREATE TABLE
postgres=# CREATE TABLE tbl_p1 PARTITION OF tbl FOR VALUES IN (1, 2);
CREATE TABLE
postgres=# CREATE TABLE tbl_p2 PARTITION OF tbl FOR VALUES IN (3, 4);
CREATE TABLE
postgres=# INSERT INTO tbl VALUES (1,'P1'),(2,'P1'),(3,'P2'),(4,'P2');
INSERT 0 4
postgres=# EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM tbl t1 INNER JOIN tbl
t2 ON (t1.a = t2.a) WHERE t1.c = 'P1' AND t1.c = 'P2';
NOTICE: join between relations (b 1) and (b 2) is considered for
partition-wise join.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

#12Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#8)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi Rajkumar,

On Fri, Sep 16, 2016 at 6:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Fri, Sep 9, 2016 at 3:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Hi All,

PFA the patch to support partition-wise joins for partitioned tables. The
patch
is based on the declarative parition support patches provided by Amit
Langote
on 26th August 2016.

I have applied declarative partitioning patches posted by Amit Langote on 26
Aug 2016 and then partition-wise-join patch, getting below error while make
install.

../../../../src/include/nodes/relation.h:706: error: redefinition of typedef
‘PartitionOptInfo’
../../../../src/include/nodes/relation.h:490: note: previous declaration of
‘PartitionOptInfo’ was here
make[4]: *** [gistbuild.o] Error 1
make[4]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access/gist'
make[3]: *** [gist-recursive] Error 2
make[3]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Thanks for the report and the patch.

This is fixed by the patch posted with
/messages/by-id/CAFjFpRdRFWMc4zNjeJB6p1Ncpznc9DMdXfZJmVK5X_us5zeD9Q@mail.gmail.com.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#10)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 20, 2016 at 4:26 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

PFA patch which takes care of some of the TODOs mentioned in my
previous mail. The patch is based on the set of patches supporting
declarative partitioning by Amit Langoted posted on 26th August.

I have applied declarative partitioning patches posted by Amit Langote on
26 Aug 2016 and then latest partition-wise-join patch, getting below error
while make install.

../../../../src/include/catalog/partition.h:37: error: redefinition of
typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of
‘PartitionScheme’ was here
make[4]: *** [commit_ts.o] Error 1
make[4]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access/transam'
make[3]: *** [transam-recursive] Error 2
make[3]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend/access'
make[2]: *** [access-recursive] Error 2
make[2]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src/backend'
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory
`/home/edb/Desktop/edb_work/WORKDB/PG_PWJ/postgresql/src'
make: *** [all-src-recurse] Error 2

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

I have commented below statement in src/include/catalog/partition.h file
and then tried to install, it worked fine.

/* typedef struct PartitionSchemeData *PartitionScheme; */

Thanks & Regards,
Rajkumar Raghuwanshi

#14Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#13)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

../../../../src/include/catalog/partition.h:37: error: redefinition of
typedef ‘PartitionScheme’
../../../../src/include/nodes/relation.h:492: note: previous declaration of
‘PartitionScheme’ was here

[...]

PS : I am using - gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17)

Thanks for the report. For some reason, I am not getting these errors
with my compiler

[ashutosh@ubuntu regress]gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

Anyway, I have fixed it in the attached patch.

The patch is based on sources upto commit

commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Sep 21 13:24:13 2016 +0300

and Amit Langote's set of patches posted on 15th Sept. 2016 [1]. /messages/by-id/e5c1c9cf-3f5a-c4d7-6047-7351147aaef9@lab.ntt.co.jp

There are few implementation details that need to be worked out like
1. adjust_partitionrel_attrs() calls adjust_appendrel_attrs() as many
times as the number of base relations in the join, possibly producing
a new expression tree in every call. It can be optimized to call
adjust_appendrel_attrs() only once. I will work on that if reviewers
agree that adjust_partitionrel_attrs() is needed and should be
optimized.

2. As mentioned in earlier mails, the paths parameterized by parent
partitioned table are translated to be parameterized by child
partitions. That code needs to support more kinds of paths. I will
work on that, if reviewers agree that the approach of translating
paths is acceptable.

3. Because of an issue with declarative partitioning patch [2]. /messages/by-id/CAFjFpRc=T+CjpGNkNSdOkHza8VAPb35bngaCdAzPgBkhijmJhg@mail.gmail.com
multi-level partition table tests are failing in partition_join.sql.
Those were not failing with an earlier set of patches supporting
declarative partitions. Those will be fixed based on the discussion in
that thread.

4. More tests for foreign tables as partitions and for multi-level
partitioned tables.

5. The tests use unpartitioned tables for verifying results. Those
tables and corresponding SQL statements will be removed once the tests
are finalised.

[1]: . /messages/by-id/e5c1c9cf-3f5a-c4d7-6047-7351147aaef9@lab.ntt.co.jp
[2]: . /messages/by-id/CAFjFpRc=T+CjpGNkNSdOkHza8VAPb35bngaCdAzPgBkhijmJhg@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_v3.patchinvalid/octet-stream; name=pg_dp_join_v3.patch
#15Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#14)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 22, 2016 at 4:11 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

The patch is based on sources upto commit

commit 2a7f4f76434d82eb0d1b5f4f7051043e1dd3ee1a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Sep 21 13:24:13 2016 +0300

and Amit Langote's set of patches posted on 15th Sept. 2016 [1]

I have applied your patch on top of Amit patches posted on 15th Sept. 2016,
and tried to create some test cases on list and multi-level partition based
on test cases written for range partition.

I got some server crash and errors which I have mentioned as comment in
expected output file, which need to be updated once these issues will get
fix. also for these issue expected output is generated by creating same
query for non-partition table with same data.

Attached patch created on top to Ashutosh's patch posted on 22 Sept 2016.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

partition_join_extra_testcases.patchinvalid/octet-stream; name=partition_join_extra_testcases.patch
#16Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#14)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

[ new patch ]

This should probably get updated since Rajkumar reported a crash.
Meanwhile, here are some comments from an initial read-through:

+ * Multiple relations may be partitioned in the same way. The relations
+ * resulting from joining such relations may be partitioned in the same way as
+ * the joining relations.  Similarly, relations derived from such relations by
+ * grouping, sorting be partitioned in the same as the underlying relations.

I think you should change "may be partitioned in the same way" to "are
partitioned in the same way" or "can be regarded as partitioned in the
same way". The sentence that begins with "Similarly," is not
grammatical; it should say something like: ...by grouping or sorting
are partitioned in the same way as the underlying relations.

@@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
result->bounds->rangeinfo = rangeinfo;
break;
}
}
}

MemoryContextSwitchTo(oldcxt);
rel->rd_partdesc = result;
}

+
/*
* Are two partition bound collections logically equal?
*
* Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
* This is also useful when b1 and b2 are bound collections of two separate
* relations, respectively, because BoundCollection is a canonical
* representation of a set partition bounds (for given partitioning strategy).
*/
bool
partition_bounds_equal(PartitionKey key,

Spurious hunk.

+ * For an umpartitioned table, it returns NULL.

Spelling.

+ * two arguemnts and returns boolean. For types, it
suffices to match

Spelling.

+ * partition key expression is stored as a single member list to accomodate

Spelling.

+ * For a base relation, construct an array of partition key expressions. Each
+ * partition key expression is stored as a single member list to accomodate
+ * more partition keys when relations are joined.

How would joining relations result in more partitioning keys getting
added? Especially given the comment for the preceding function, which
says that a new PartitionScheme gets created unless an exact match is
found.

+ if (!lc)

Test lc == NIL instead of !lc.

+extern int
+PartitionSchemeGetNumParts(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->nparts : 0;
+}

I'm not convinced it's a very good idea for this function to have
special handling for when part_scheme is NULL. In
try_partition_wise_join() that checks is not needed because it's
already been done, and in generate_partition_wise_join_paths it is
needed but only because you are initializing nparts too early. If you
move this initialization down below the IS_DUMMY_REL() check you won't
need the NULL guard. I would ditch this function and let the callers
access the structure member directly.

+extern int
+PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->partnatts : 0;
+}

Similarly here. have_partkey_equi_join should probably have a
quick-exit path when part_scheme is NULL, and then num_pks can be set
afterwards unconditionally. Same for match_expr_to_partition_keys.
build_joinrel_partition_info already has it and doesn't need this
double-check.

+extern Oid *
+PartitionDescGetPartOids(PartitionDesc part_desc)
+{
+    Oid       *part_oids;
+    int        cnt_parts;
+
+    if (!part_desc || part_desc->nparts <= 0)
+        return NULL;
+
+    part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
+    for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
+        part_oids[cnt_parts] = part_desc->oids[cnt_parts];
+
+    return part_oids;
+}

I may be missing something, but this looks like a bad idea in multiple
ways. First, you've got checks for part_desc's validity here that
should be in the caller, as noted above. Second, you're copying an
array by looping instead of using memcpy(). Third, the one and only
caller is set_append_rel_size, which doesn't seem to have any need to
copy this data in the first place. If there is any possibility that
the PartitionDesc is going to change under us while that function is
running, something is deeply broken. Nothing in the planner is going
to cope with the table structure changing under us, so it had better
not.

+    /*
+     * For a partitioned relation, we will save the child RelOptInfos in parent
+     * RelOptInfo in the same the order as corresponding bounds/lists are
+     * stored in the partition scheme.
+     */

This comment seems misplaced; shouldn't it be next to the code that is
actually doing this, rather than the code that is merely setting up
for it? And, also, the comment implies that we're doing this instead
of what we'd normally do, whereas I think we are actually doing
something additional.

+        /*
+         * Save topmost parent's relid. If the parent itself is a child of some
+         * other relation, use parent's topmost parent relids.
+         */
+        if (rel->top_parent_relids)
+            childrel->top_parent_relids = rel->top_parent_relids;
+        else
+            childrel->top_parent_relids = bms_copy(rel->relids);

Comment should explain why we're doing it, not what we're doing. The
comment as written just restates what anybody who's likely to be
looking at this can already see to be true from looking at the code
that follows. The question is why do it.

+    /* Set only for "other" base or join relations. */
+    Relids        top_parent_relids;

Comment should say what it is, not just when it's set.

+    /* Should have found all the childrels of a partitioned relation. */
+    if (rel->part_scheme)
+    {
+        int        cnt_parts;
+        for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+            Assert(rel->part_rels[cnt_parts]);
+    }

A block that does nothing but Assert() should be guarded by #ifdef
USE_ASSERT_CHECKING. Although, actually, maybe this should be an
elog(), just in case?

+    }
+
+    add_paths_to_append_rel(root, rel, live_childrels);
+}
+
+static void
+add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
+                        List *live_childrels)

The new function should have a header comment, which should include an
explanation of why this is now separate from
set_append_rel_pathlist().

+ if (!live_childrels)

As before, I think live_childrels == NIL is better style.

+ generate_partition_wise_join_paths(root, rel);

Needs an update to the comment earlier in the hunk. It's important to
explain why this has to be done here and not within
join_search_one_level.

+            /* Recursively collect the paths from child joinrel. */
+            generate_partition_wise_join_paths(root, child_rel);

Given the recursion, check_stack_depth() at top of function is
probably appropriate. Same for try_partition_wise_join().

+ if (live_children)
+ pfree(live_children);

Given that none of the substructure, including ListCells, will be
freed, this seems utterly pointless. If it's necessary to recover
memory here at all, we probably need to be more aggressive about it.
Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions? If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly? Quadratically? Exponentially? Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.

     /*
+     * An inner path parameterized by the parent relation of outer
+     * relation needs to be reparameterized by the outer relation to be used
+     * for parameterized nested loop join.
+     */

No doubt, but I think the comment is missing the bigger picture -- it
doesn't say anything about this being here to support partition-wise
joins, which seems like a key point.

+        /* If we could not translate the path, don't produce nest loop path. */
+        if (!inner_path)
+            return;

Why would that ever happen?

+/*
+ * If the join between the given two relations can be executed as
+ * partition-wise join create the join relations for partition-wise join,
+ * create paths for those and then create append paths to combine
+ * partition-wise join results.
+ */
+static void
+try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+                        RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
+                        List *parent_restrictlist)

This comment doesn't accurately describe what the function does. No
append paths are created here; that happens at a much later stage. I
think this comment needs quite a bit more work, and maybe the function
should be renamed, too. There are really two steps involved here:
first, we create paths for each child, attached to a new RelOptInfo
flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
paths for the parent RelOptInfo by appending a path for each child.

Broadly, I think there's a lack of adequate documentation of the
overall theory of operation of this patch. I believe that an update
to the optimizer README would be appropriate, probably with a new
section but maybe incorporating the new material into an existing
section. In addition, the comments for individual comments and chunks
of code need to do a better job explaining how each part of the patch
contributes to the overall picture. I also think we need to do a
better join hammering out the terminology. I don't particularly like
the term "partition-wise join" in the first place, although I don't
know what would be better, but we certainly need to avoid confusing a
partition-wise join -- which is a join performed by joining each
partition of one partitioned rel to the corresponding partition of a
similarly partitioned rel rather than by the usual execution strategy
of joining the parent rels -- with the concept of an other-join-rel,
which an other-member-rel analogue for joins. I don't think the patch
is currently very clear about this right now, either in the code or in
the comments. Maybe this function ought to be named something like
make_child_joins() or make_child_join_paths(), and we could use "child
joins" and/or "child join paths" as standard terminology throughout
the patch.

+    rel1_desc = makeStringInfo();
+    rel2_desc = makeStringInfo();
+
+    /* TODO: remove this notice when finalising the patch. */
+    outBitmapset(rel1_desc, rel1->relids);
+    outBitmapset(rel2_desc, rel2->relids);
+    elog(NOTICE, "join between relations %s and %s is considered for
partition-wise join.",
+         rel1_desc->data, rel2_desc->data);

Please remove your debugging cruft before submitting patches to
pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

+     * We allocate the array for child RelOptInfos till we find at least one
+     * join order which can use partition-wise join technique. If no join order
+     * can use partition-wise join technique, there are no child relations.

This comment has problems. I think "till" is supposed to be "until",
and there's supposed to be a "don't" in there somewhere. But really,
I think what you're going for is just /* Allocate when first needed */
which would be a lot shorter and also more clear.

+     * Create join relations for the partition relations, if they do not exist
+     * already. Add paths to those for the given pair of joining relations.

I think the comment could be a bit more explanatory here. Something
like: "This joinrel is partitioned, so iterate over the partitions and
create paths for each one, allowing us to eventually build an
append-of-joins path for the parent. Since this routine may be called
multiple times for various join orders, the RelOptInfo needed for each
child join may or may not already exist, but the paths for this join
order definitely do not. Note that we don't create any actual
AppendPath at this stage; it only makes sense to do that at the end,
after each possible join order has been considered for each child
join. The best join order may differ from child to child."

+ * partiticipating in the given partition relations. We need them

Spelling.

+/*
+ * Construct the SpecialJoinInfo for the partition-wise join using parents'
+ * special join info. Also, instead of
+ * constructing an sjinfo everytime, we should probably save it in
+ * root->join_info_list and search within it like join_is_legal?
+ */

The lines here are of very different lengths for no particularly good
reason, and it should end with a period, not a question mark.

On the substance of the issue, it seems like the way you're doing this
right now could allocate a very large number of SpecialJoinInfo
structures. For every join relation, you'll create one
SpecialJoinInfo per legal join order per partition. That seems like
it could get to be a big number. I don't know if that's going to be a
problem from a memory-usage standpoint, but it seems like it might.
It's not just the SpecialJoinInfo itself; all of the substructure gets
duplicated, too.

+    SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
+    sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
+                                                   append_rel_infos1);

Missing a blank line here.

+ AppendRelInfo *ari = lfirst(lc);

Standard naming convention for an AppendRelInfo variable seems to be
appinfo, not ari. (I just did "git grep AppendRelInfo".)

+        /* Skip non-equi-join clauses. */
+        if (!rinfo->can_join ||
+            rinfo->hashjoinoperator == InvalidOid ||
+            !rinfo->mergeopfamilies)
+            continue;

There's definitely something ugly about this. If rinfo->can_join is
false, then we're done. But suppose one of mergeopfamilies == NIL and
rinfo->hashoperator == InvalidOid is true and the other is false. Are
we really precluded from doing a partiion-wise join in that case, or
are we just prohibited from using certain join strategies? In most
places where we make similar tests, we're careful not to require more
than we need.

I also think that these tests need to consider the partitioning
operator in use. Suppose that the partition key is of a type T which
has two operator classes X and Y. Both relations are partitioned
using an operator from opfamily X, but the join condition mentions
opfamily Y. I'm pretty sure this precludes a partitionwise join. If
the join condition used opfamily X, then we would have a guarantee
that two rows which compared as equal would be in the same partition,
but because it uses opfamily Y, that's not guaranteed. For example,
if T is a text type, X might test for exact equality using "C"
collation rules, while Y might test for equality using some
case-insensitive set of rules. If the partition boundaries are such
that "foo" and "FOO" are in different partitions, a partitionwise join
using the case-insensitive operator will produce wrong results. You
can also imagine this happening with numeric, if you have one opclass
(like the default one) that considers 5.0 and 5.00 to be equal, but
another opclass that thinks they are different; if the latter is used
to set the partition bounds, 5.0 and 5.00 could end up in different
partitions - which will be fine if an operator from that opclass is
used for the join, but not if an operator from the regular opclass is
used.

After thinking this over a bit, I think the right way to think about this is:

1. Amit's patch currently only ever uses btree opfamilies for
partitioning. It uses those for both range partitioning and list
partitioning. If we ever support hash partitioning, we would
presumably use hash opfamilies for that purpose, but right now it's
all about btree opfamilies.

2. Therefore, if A and B are partitioned but the btree opfamilies
don't match, they don't have the same partitioning scheme and this
code should never be reached. Similarly, if they use the same
opfamily but different collations, the partitioning schemes shouldn't
match and therefore this code should not be reached.

3. If A and B are partitioned and the partitioning opfamilies - which
are necessarily btree opfamilies - do match, then the operator which
appears in the query needs to be from the same opfamily and have
amopstrategy of BTEqualStrategyNumber within that opfamily. If not,
then a partition-wise join is not possible.

4. Assuming the above conditions are met, have_partkey_equi_join
doesn't need to care whether the operator chosen has mergeopfamilies
or a valid hashjoinoperator. Those factors will control which join
methods are legal, but not whether a partitionwise join is possible in
principle.

Let me know whether that seems right.

+ * RelabelType node; eval_const_expressions() will have simplied if more

Spelling.

     /*
+     * Code below scores equivalence classes by how many equivalence members
+     * can produce join clauses for this join relation. Equivalence members
+     * which do not cover the parents of a partition-wise join relation, can
+     * produce join clauses for partition-wise join relation.
+     */

I don't know what that means. The comma in the second sentence
doesn't belong there.

+    /*
+     * TODO: Instead of copying and mutating the trees one child relation at a
+     * time, we should be able to do this en-masse for all the partitions
+     * involved.
+     */

I don't see how that would be possible, but if it's a TODO, you'd
better do it (or decide not to do it and remove or change the
comment).

     /*
      * Create explicit sort nodes for the outer and inner paths if necessary.
      */
     if (best_path->outersortkeys)
     {
+        Relids        outer_relids = outer_path->parent->relids;
         Sort       *sort = make_sort_from_pathkeys(outer_plan,
-                                                   best_path->outersortkeys);
+                                                   best_path->outersortkeys,
+                                                   outer_relids);

The changes related to make_sort_from_pathkeys() are pretty opaque to
me. Can you explain?

+ * Change parameterization of sub paths recursively. Also carry out any

"sub paths" should not be two words, here or anywhere.

+reparameterize_path_for_child(PlannerInfo *root, Path *path,
+                              RelOptInfo *child_rel)

This is suspiciously unlike reparameterize_path. Why?

+    /* Computer information relevant to the foreign relations. */
+    set_foreign_rel_properties(joinrel, outer_rel, inner_rel);

Perhaps this refactoring could be split out into a preliminary patch,
which would then simplify this patch. And same for add_join_rel().

+     * Produce partition-wise joinrel's targetlist by translating the parent
+     * joinrel's targetlist. This will also include the required placeholder

Again the confusion between a "child" join and a partition-wise join...

+    /*
+     * Nothing to do if
+     * a. partition-wise join is disabled.
+     * b. joining relations are not partitioned.
+     * c. partitioning schemes do not match.
+     */
+

I don't think that's going to survive pgindent.

+ * are not considered equal, an equi-join involing inner partition keys

Spelling.

+     * Collect the partition key expressions. An OUTER join will produce rows
+     * where the partition key columns of inner side are NULL and may not fit
+     * the partitioning scheme with inner partition keys. Since two NULL values
+     * are not considered equal, an equi-join involing inner partition keys
+     * still prohibits cross-partition joins while joining with another
+     * similarly partitioned relation.

I can't figure out what this comment is trying to tell me. Possibly I
just need more caffeine.

+ * Adding these two join_rel_level list also means that top level list has more
+ * than one join relation, which is symantically incorrect.

I don't understand this, either; also, spelling.

As a general comment, the ratio of tests-to-code in this patch is way
out of line with PostgreSQL's normal practices. The total patch file
is 10965 lines. The test cases begin at line 3047, meaning that in
round figures you've got about one-quarter code and about
three-quarters test cases. I suspect that a large fraction of those
test cases aren't adding any meaningful code coverage and will just
take work to maintain. That needs to be slimmed down substantially in
any version of this considered for commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#16)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 28, 2016 at 2:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 22, 2016 at 6:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

[ new patch ]

This should probably get updated since Rajkumar reported a crash.
Meanwhile, here are some comments from an initial read-through:

Done. Fixed those crashes. Also fixed some crashes in foreign table
code and postgres_fdw. The tests were provided by Rajkumar. I am
working on including those in my patch. The attached patch is still
based on Amit's patches set of patches posted on 15th Sept. 2016. He
is addressing your comments on his patches. So, I am expecting a more
stable version arrive soon. I will rebase my patches then. Because of
a bug in those patches related to multi-level partitioned tables and
lateral joins and also a restriction on sharing partition keys across
levels of partitions, the testcase is still failing. I will work on
that while rebasing the patch.

+ * Multiple relations may be partitioned in the same way. The relations
+ * resulting from joining such relations may be partitioned in the same way as
+ * the joining relations.  Similarly, relations derived from such relations by
+ * grouping, sorting be partitioned in the same as the underlying relations.

I think you should change "may be partitioned in the same way" to "are
partitioned in the same way" or "can be regarded as partitioned in the
same way".

The relations resulting from joining partitioned relations are
partitioned in the same way, if there exist equi-join condition/s
between their partition keys. If such equi-joins do not exist, the
join is *not* partitioned. Hence I did not use "are" or "can be" which
indicate a certainty. Instead I used "may" which indicates
"uncertainty". I am not sure whether that's a good place to explain
the conditions under which such relations are partitioned. Those
conditions will change as we implement more and more partition-wise
join strategies. But that comment conveys two things 1. partition
scheme makes sense for all kinds of relations 2. multiple relations
(of any kind) may share partition scheme. I have slightly changed the
wording to make this point clear. Please let me know if it looks
better.

The sentence that begins with "Similarly," is not
grammatical; it should say something like: ...by grouping or sorting
are partitioned in the same way as the underlying relations.

Done. Instead of "are" I have used "may" for the same reason as above.

@@ -870,20 +902,21 @@ RelationBuildPartitionDesc(Relation rel)
result->bounds->rangeinfo = rangeinfo;
break;
}
}
}

MemoryContextSwitchTo(oldcxt);
rel->rd_partdesc = result;
}

+
/*
* Are two partition bound collections logically equal?
*
* Used in the keep logic of relcache.c (ie, in RelationClearRelation()).
* This is also useful when b1 and b2 are bound collections of two separate
* relations, respectively, because BoundCollection is a canonical
* representation of a set partition bounds (for given partitioning strategy).
*/
bool
partition_bounds_equal(PartitionKey key,

Spurious hunk.

Thanks. Done.

+ * For an umpartitioned table, it returns NULL.

Spelling.

Done. Thanks.

+ * two arguemnts and returns boolean. For types, it
suffices to match

Spelling.

Thanks. Done.

+ * partition key expression is stored as a single member list to accomodate

Spelling.

Thanks. Done.

+ * For a base relation, construct an array of partition key expressions. Each
+ * partition key expression is stored as a single member list to accomodate
+ * more partition keys when relations are joined.

How would joining relations result in more partitioning keys getting
added? Especially given the comment for the preceding function, which
says that a new PartitionScheme gets created unless an exact match is
found.

Let's assume that relation A and B are partitioned by columns a and b
resp. and have same partitioning scheme. This means that the datatypes
of a and b as well as the opclass used for comparing partition key
values of A and B are same. A join between A and B with condition A.a
= B.b is partitioned by both A.a and B.b. We need to keep track of
both the keys in case AB joins with C which is partitioned in the same
manner. I guess, the confusion is with the term "partition keys" -
which is being used to indicate the class of partition key as well as
instance of partition key. In the above example, the datatype of
partition key and the opclass together indicate partition key class
whereas A.a and B.b are instances of that class. Increase in partition
keys may mean both increase in the number of classes or increase in
the number of instances. In the above comment I used to mean number of
instances. May be we should use "partition key expressions" to
indicate the partition key instances and "partition key" to indicate
partition key class. I have changed the comments to use partition keys
and partition key expressions appropriately. Please let me know if the
comments are worded correctly.

PartitionScheme does not hold the actual partition key expressions. It
holds the partition key type and opclass used for comparison, which
should be same for all the relations sharing the partition scheme.

+ if (!lc)

Test lc == NIL instead of !lc.

NIL is defined as (List *) NULL and lc is ListCell *. So changed the
test to lc == NULL instead of !lc.

+extern int
+PartitionSchemeGetNumParts(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->nparts : 0;
+}

I'm not convinced it's a very good idea for this function to have
special handling for when part_scheme is NULL. In
try_partition_wise_join() that checks is not needed because it's
already been done, and in generate_partition_wise_join_paths it is
needed but only because you are initializing nparts too early. If you
move this initialization down below the IS_DUMMY_REL() check you won't
need the NULL guard. I would ditch this function and let the callers
access the structure member directly.

+extern int
+PartitionSchemeGetNumKeys(PartitionScheme part_scheme)
+{
+    return part_scheme ? part_scheme->partnatts : 0;
+}

Similarly here. have_partkey_equi_join should probably have a
quick-exit path when part_scheme is NULL, and then num_pks can be set
afterwards unconditionally. Same for match_expr_to_partition_keys.
build_joinrel_partition_info already has it and doesn't need this
double-check.

+extern Oid *
+PartitionDescGetPartOids(PartitionDesc part_desc)
+{
+    Oid       *part_oids;
+    int        cnt_parts;
+
+    if (!part_desc || part_desc->nparts <= 0)
+        return NULL;
+
+    part_oids = (Oid *) palloc(sizeof(Oid) * part_desc->nparts);
+    for (cnt_parts = 0; cnt_parts < part_desc->nparts; cnt_parts++)
+        part_oids[cnt_parts] = part_desc->oids[cnt_parts];
+
+    return part_oids;
+}

I may be missing something, but this looks like a bad idea in multiple
ways. First, you've got checks for part_desc's validity here that
should be in the caller, as noted above. Second, you're copying an
array by looping instead of using memcpy(). Third, the one and only
caller is set_append_rel_size, which doesn't seem to have any need to
copy this data in the first place. If there is any possibility that
the PartitionDesc is going to change under us while that function is
running, something is deeply broken. Nothing in the planner is going
to cope with the table structure changing under us, so it had better
not.

These three functions were written based on Amit Langote's patches
which did not expose partition related structures outside partition.c.
Hence they required wrappers. I have moved PartitionSchemeData to
partition.h and removed these functions. Instead the members are
accessed directly.

+    /*
+     * For a partitioned relation, we will save the child RelOptInfos in parent
+     * RelOptInfo in the same the order as corresponding bounds/lists are
+     * stored in the partition scheme.
+     */

This comment seems misplaced; shouldn't it be next to the code that is
actually doing this, rather than the code that is merely setting up
for it? And, also, the comment implies that we're doing this instead
of what we'd normally do, whereas I think we are actually doing
something additional.

Ok. I have moved the comment few line below, near the code which saves
the partition RelOptInfos.

+        /*
+         * Save topmost parent's relid. If the parent itself is a child of some
+         * other relation, use parent's topmost parent relids.
+         */
+        if (rel->top_parent_relids)
+            childrel->top_parent_relids = rel->top_parent_relids;
+        else
+            childrel->top_parent_relids = bms_copy(rel->relids);

Comment should explain why we're doing it, not what we're doing. The
comment as written just restates what anybody who's likely to be
looking at this can already see to be true from looking at the code
that follows. The question is why do it.

The point of that comment is to explain how it percolates down the
hierarchy, which is not so clear from the code. I have changed it to
read
/*
* Recursively save topmost parent's relid in RelOptInfos of
* partitions.
*/

Or you are expecting that the comment to explain the purpose of
top_parent_relids? I don't think that's a good idea, since the purpose
will change over the time and the comment will soon be out of sync
with the actual code, unless the developers expanding the usage
remember to update the comment. I have not seen the comments,
explaining purpose, next to the assignments. Take for example
RelOptInfo::relids.

+    /* Set only for "other" base or join relations. */
+    Relids        top_parent_relids;

Comment should say what it is, not just when it's set.

Done. Check if it looks good.

+    /* Should have found all the childrels of a partitioned relation. */
+    if (rel->part_scheme)
+    {
+        int        cnt_parts;
+        for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+            Assert(rel->part_rels[cnt_parts]);
+    }

A block that does nothing but Assert() should be guarded by #ifdef
USE_ASSERT_CHECKING. Although, actually, maybe this should be an
elog(), just in case?

Changed it to elog().

+    }
+
+    add_paths_to_append_rel(root, rel, live_childrels);
+}
+
+static void
+add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
+                        List *live_childrels)

The new function should have a header comment, which should include an
explanation of why this is now separate from
set_append_rel_pathlist().

Sorry for missing it. Added the prologue. Let me know, if it looks
good. I have made sure that all functions have a prologue and tried to
match the style with surrounding functions. Let me know if I have
still missed any or the styles do not match.

+ if (!live_childrels)

As before, I think live_childrels == NIL is better style.

Fixed.

+ generate_partition_wise_join_paths(root, rel);

Needs an update to the comment earlier in the hunk. It's important to
explain why this has to be done here and not within
join_search_one_level.

Thanks for pointing that out. Similar to generate_gather_paths(), we
need to add explanation in standard_join_search() as well as in the
function prologue. Did that. Let me know if it looks good.

+            /* Recursively collect the paths from child joinrel. */
+            generate_partition_wise_join_paths(root, child_rel);

Given the recursion, check_stack_depth() at top of function is
probably appropriate. Same for try_partition_wise_join().

Done. I wouldn't imagine a user creating that many levels of
partitions, but it's good to guard against some automated script that
has gone berserk.

+ if (live_children)
+ pfree(live_children);

Given that none of the substructure, including ListCells, will be
freed, this seems utterly pointless. If it's necessary to recover
memory here at all, we probably need to be more aggressive about it.

I intended to use list_free() instead of pfree(). Fixed that.

Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions? If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly? Quadratically? Exponentially? Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.

I tried to check memory usage with various combinations of number of
partitions and number of relations being joined. For higher number of
relations being joined like 10 with 100 partitions, OOM killer kicked
in during the planning phase. I am suspecting
adjust_partitionrel_attrs() (changed that name to
adjust_join_appendrel_attrs() to be in sync with
adjust_appendrel_attrs()) to be the culprit. It copies expression
trees every time for joining two children. That's an exponentially
increasing number as the number of legal joins increases
exponentially. I am still investigating this.

As a side question, do we have a function to free an expression tree?
I didn't find any.

/*
+     * An inner path parameterized by the parent relation of outer
+     * relation needs to be reparameterized by the outer relation to be used
+     * for parameterized nested loop join.
+     */

No doubt, but I think the comment is missing the bigger picture -- it
doesn't say anything about this being here to support partition-wise
joins, which seems like a key point.

I have tried to explain the partition-wise join context. Let me know
if it looks good.

+        /* If we could not translate the path, don't produce nest loop path. */
+        if (!inner_path)
+            return;

Why would that ever happen?

Right now, reparameterize_path_for_child() does not support all kinds
of paths. So I have added that condition. I will add support for more
path types there once we agree that this is the right way to translate
the paths and that the path translation is required.

+/*
+ * If the join between the given two relations can be executed as
+ * partition-wise join create the join relations for partition-wise join,
+ * create paths for those and then create append paths to combine
+ * partition-wise join results.
+ */
+static void
+try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
+                        RelOptInfo *joinrel, SpecialJoinInfo *parent_sjinfo,
+                        List *parent_restrictlist)

This comment doesn't accurately describe what the function does. No
append paths are created here; that happens at a much later stage.

Removed reference to the append paths. Sorry for leaving it there,
when I moved the append path creation to a later stage.

I
think this comment needs quite a bit more work, and maybe the function
should be renamed, too.

Improved the comments in the prologue and inside the function. Please
let me know, if they look good.

There are really two steps involved here:
first, we create paths for each child, attached to a new RelOptInfo
flagged as RELOPT_OTHER_JOINREL paths; later, we create additional
paths for the parent RelOptInfo by appending a path for each child.

Right, the first one is done in try_partition_wise_join() and the
later is done in generate_partition_wise_join_paths()

Broadly, I think there's a lack of adequate documentation of the
overall theory of operation of this patch. I believe that an update
to the optimizer README would be appropriate, probably with a new
section but maybe incorporating the new material into an existing
section.

Done. I have added a separate section to optimizer/README

In addition, the comments for individual comments and chunks
of code need to do a better job explaining how each part of the patch
contributes to the overall picture.

I also think we need to do a
better join hammering out the terminology. I don't particularly like
the term "partition-wise join" in the first place, although I don't
know what would be better, but we certainly need to avoid confusing a
partition-wise join -- which is a join performed by joining each
partition of one partitioned rel to the corresponding partition of a
similarly partitioned rel rather than by the usual execution strategy
of joining the parent rels -- with the concept of an other-join-rel,
which an other-member-rel analogue for joins. I don't think the patch
is currently very clear about this right now, either in the code or in
the comments. Maybe this function ought to be named something like
make_child_joins() or make_child_join_paths(), and we could use "child
joins" and/or "child join paths" as standard terminology throughout
the patch.

Partition-wise join is widely used term in the literature. Other
DBMSes use the same term as well. So, I think we should stick with
"partition-wise join". Partition-wise join as you have described is a
join performed by joining each partition of one partitioned rel to the
corresponding partition of a similarly partitioned rel rather than by
the usual execution strategy of joining the parent rels. I have
usually used the term "partition-wise join technique" to refer to this
method. I have changed the other usages of this term to use wording
like child joins or join between partiitions or join between child
relations as appropriate. Also, I have changed the names of functions
dealing with joins between partitions to use child_join instead of
partition_join or partition_wise_join.

Since partition-wise join is a method to join two relations just like
other methods, try_partition_wise_join() fits into the naming
convention try_<join technique> like try_nestloop_join.

+    rel1_desc = makeStringInfo();
+    rel2_desc = makeStringInfo();
+
+    /* TODO: remove this notice when finalising the patch. */
+    outBitmapset(rel1_desc, rel1->relids);
+    outBitmapset(rel2_desc, rel2->relids);
+    elog(NOTICE, "join between relations %s and %s is considered for
partition-wise join.",
+         rel1_desc->data, rel2_desc->data);

Please remove your debugging cruft before submitting patches to
pgsql-hackers, or at least put #ifdef NOT_USED or something around it.

I kept this one intentionally. But as the TODO comment says, I do
intend to remove it once testing is over. Those messages make it very
easy to know whether partition-wise join was considered for a given
join or not. Without those messages, one has to break into
try_partition_wise_join() to figure out whether partition-wise join
was used or not. The final plan may not come out to be partition-wise
join plan even if partition-wise join was considered. Although, I have
now used DEBUG3 instead of NOTICE and removed those lines from the
expected output.

+     * We allocate the array for child RelOptInfos till we find at least one
+     * join order which can use partition-wise join technique. If no join order
+     * can use partition-wise join technique, there are no child relations.

This comment has problems. I think "till" is supposed to be "until",
and there's supposed to be a "don't" in there somewhere. But really,
I think what you're going for is just /* Allocate when first needed */
which would be a lot shorter and also more clear.

Sorry for those mistakes. Yes, shorter version is better. Fixed the
comment as per your suggestion.

+     * Create join relations for the partition relations, if they do not exist
+     * already. Add paths to those for the given pair of joining relations.

I think the comment could be a bit more explanatory here. Something
like: "This joinrel is partitioned, so iterate over the partitions and
create paths for each one, allowing us to eventually build an
append-of-joins path for the parent. Since this routine may be called
multiple times for various join orders, the RelOptInfo needed for each
child join may or may not already exist, but the paths for this join
order definitely do not. Note that we don't create any actual
AppendPath at this stage; it only makes sense to do that at the end,
after each possible join order has been considered for each child
join. The best join order may differ from child to child."

Copied verbatim. Thanks for the detailed comment.

+ * partiticipating in the given partition relations. We need them

Spelling.

Done. Also fixed other grammatical mistakes and typos in that comment.

+/*
+ * Construct the SpecialJoinInfo for the partition-wise join using parents'
+ * special join info. Also, instead of
+ * constructing an sjinfo everytime, we should probably save it in
+ * root->join_info_list and search within it like join_is_legal?
+ */

The lines here are of very different lengths for no particularly good
reason, and it should end with a period, not a question mark.

My bad. Sorry. Fixed.

On the substance of the issue, it seems like the way you're doing this
right now could allocate a very large number of SpecialJoinInfo
structures. For every join relation, you'll create one
SpecialJoinInfo per legal join order per partition. That seems like
it could get to be a big number. I don't know if that's going to be a
problem from a memory-usage standpoint, but it seems like it might.
It's not just the SpecialJoinInfo itself; all of the substructure gets
duplicated, too.

Yes. We need the SpecialJoinInfo structures for the existing path
creation to work. The code will be complicated if we try to use parent
SpecialJoinInfo instead of creating those for children. We may free
memory allocated in SpecialJoinInfo to save some memory.
SpecialJoinInfos are not needed once the paths are created. Still we
will waste some memory for semi_rhs_exprs, which are reused for unique
paths. But otherwise we will reclaim the rest of the memory. Memory
wastage in adjust_partition_relids() may be minimized by modifying
adjust_appendrel_attrs() to accept list of AppendRelInfos and mutating
the tree only once rather than doing it N times for an N-way join.

+    SpecialJoinInfo *sjinfo = copyObject(parent_sjinfo);
+    sjinfo->min_lefthand = adjust_partition_relids(sjinfo->min_lefthand,
+                                                   append_rel_infos1);

Missing a blank line here.

Done.

+ AppendRelInfo *ari = lfirst(lc);

Standard naming convention for an AppendRelInfo variable seems to be
appinfo, not ari. (I just did "git grep AppendRelInfo".)

Done.

+        /* Skip non-equi-join clauses. */
+        if (!rinfo->can_join ||
+            rinfo->hashjoinoperator == InvalidOid ||
+            !rinfo->mergeopfamilies)
+            continue;

There's definitely something ugly about this. If rinfo->can_join is
false, then we're done. But suppose one of mergeopfamilies == NIL and
rinfo->hashoperator == InvalidOid is true and the other is false. Are
we really precluded from doing a partiion-wise join in that case, or
are we just prohibited from using certain join strategies? In most
places where we make similar tests, we're careful not to require more
than we need.

Right. That condition is flawed. Corrected it.

I also think that these tests need to consider the partitioning
operator in use. Suppose that the partition key is of a type T which
has two operator classes X and Y. Both relations are partitioned
using an operator from opfamily X, but the join condition mentions
opfamily Y. I'm pretty sure this precludes a partitionwise join. If
the join condition used opfamily X, then we would have a guarantee
that two rows which compared as equal would be in the same partition,
but because it uses opfamily Y, that's not guaranteed. For example,
if T is a text type, X might test for exact equality using "C"
collation rules, while Y might test for equality using some
case-insensitive set of rules. If the partition boundaries are such
that "foo" and "FOO" are in different partitions, a partitionwise join
using the case-insensitive operator will produce wrong results. You
can also imagine this happening with numeric, if you have one opclass
(like the default one) that considers 5.0 and 5.00 to be equal, but
another opclass that thinks they are different; if the latter is used
to set the partition bounds, 5.0 and 5.00 could end up in different
partitions - which will be fine if an operator from that opclass is
used for the join, but not if an operator from the regular opclass is
used.

Your description above uses opfamily and opclass interchangeably. It
starts saying X and Y are classed but then also refers to them as
families. But I got the point. I guess, similar to
relation_has_unique_index_for(), I have to check whether the operator
family specified in the partition scheme is present in the
mergeopfamilies in RestrictInfo for matching partition key. I have
added that check and restructured that portion of code to be readable.

After thinking this over a bit, I think the right way to think about this is:

1. Amit's patch currently only ever uses btree opfamilies for
partitioning. It uses those for both range partitioning and list
partitioning. If we ever support hash partitioning, we would
presumably use hash opfamilies for that purpose, but right now it's
all about btree opfamilies.

2. Therefore, if A and B are partitioned but the btree opfamilies
don't match, they don't have the same partitioning scheme and this
code should never be reached. Similarly, if they use the same
opfamily but different collations, the partitioning schemes shouldn't
match and therefore this code should not be reached.

That's right.

3. If A and B are partitioned and the partitioning opfamilies - which
are necessarily btree opfamilies - do match, then the operator which
appears in the query needs to be from the same opfamily and have
amopstrategy of BTEqualStrategyNumber within that opfamily. If not,
then a partition-wise join is not possible.

I think this is achieved by checking whether the opfamily for given
partition key is present in the mergeopfamilies of corresponding
RestrictInfo, as stated above.

4. Assuming the above conditions are met, have_partkey_equi_join
doesn't need to care whether the operator chosen has mergeopfamilies
or a valid hashjoinoperator. Those factors will control which join
methods are legal, but not whether a partitionwise join is possible in
principle.

If mergeopfamilies is NIL, above check will fail anyway. But skipping
a clause which has mergeopfamilies NIL will save some cycles in
matching expressions.

There is something strange happening with Amit's patch. When we create
a table partitioned by range on a column of type int2vector, it
somehow gets a btree operator family, but doesn't have mergeopfamilies
set in RestrictInfo of equality condition on that column. Instead the
RestrictInfo has hashjoinoperator. In this case if we ignore
hashjoinoperator, we won't be able to apply partition-wise join. I
guess, in such case we want to play safe and not apply partition-wise
join, even though applying it will give the correct result.

+ * RelabelType node; eval_const_expressions() will have simplied if more

Spelling.

Thanks. Done.

/*
+     * Code below scores equivalence classes by how many equivalence members
+     * can produce join clauses for this join relation. Equivalence members
+     * which do not cover the parents of a partition-wise join relation, can
+     * produce join clauses for partition-wise join relation.
+     */

I don't know what that means. The comma in the second sentence
doesn't belong there.

Sorry for that construction. I have changed the comment to be
something more meaningful.

+    /*
+     * TODO: Instead of copying and mutating the trees one child relation at a
+     * time, we should be able to do this en-masse for all the partitions
+     * involved.
+     */

I don't see how that would be possible, but if it's a TODO, you'd
better do it (or decide not to do it and remove or change the
comment).

That should be doable by passing a list of AppendRelInfo structures to
adjust_appendrel_attrs_mutator(). In the mutator, we have to check
each appinfo instead of just one. But that's a lot of refactoring. May
be done as a separate patch, if we are consuming too much memory. I
have removed TODO for now.

/*
* Create explicit sort nodes for the outer and inner paths if necessary.
*/
if (best_path->outersortkeys)
{
+        Relids        outer_relids = outer_path->parent->relids;
Sort       *sort = make_sort_from_pathkeys(outer_plan,
-                                                   best_path->outersortkeys);
+                                                   best_path->outersortkeys,
+                                                   outer_relids);

The changes related to make_sort_from_pathkeys() are pretty opaque to
me. Can you explain?

prepare_sort_from_pathkeys() accepts Relids as one of the argument to
find equivalence members belonging to child relations. The function
does not expect relids when searching equivalence members for parent
relations. Before this patch, make_sort_from_pathkeys() passed NULL to
this function, because it didn't expect child relations before.
Because of partition-wise joins, we need to sort child relations for
merge join or to create unique paths. So, make_sort_from_pathkeys() is
required to pass relids to prepare_sort_from_pathkeys() when
processing child relations, so that the later does not skip child
members.

+ * Change parameterization of sub paths recursively. Also carry out any

"sub paths" should not be two words, here or anywhere.

Fixed.

+reparameterize_path_for_child(PlannerInfo *root, Path *path,
+                              RelOptInfo *child_rel)

This is suspiciously unlike reparameterize_path. Why?

reparameterize_path() tries to create path with new parameterization
from an existing parameterized path. So, it looks for additional
conditions to expand the parameterization. But this functions
translates a path parameterized by parent to be parameterized by its
child. That does not involve looking for any extra conditions, but
involves translating the existing ones so that they can be used with a
child. A right name would be translate_parampath_to_child() or
something which uses word "translate" instead of "reparameterize". But
every name like that is getting too long. For now I have renamed it as
reparameterize_path_by_child(). Also added a comment in the function
prologue about cost, rows, width etc.

+    /* Computer information relevant to the foreign relations. */
+    set_foreign_rel_properties(joinrel, outer_rel, inner_rel);

Perhaps this refactoring could be split out into a preliminary patch,
which would then simplify this patch. And same for add_join_rel().

Yes, that's better. I will separate the code out in a separate patch.

There's code in build_join_rel() and build_partition_join_rel() (I
will change that name) which creates a joinrel RelOptInfo. Most of
that code simply sets NULL or 0 fields and is duplicated in both the
functions. Do you see any value in separating it out in its own
function?

Also, makeNode() uses palloc0(), thus makeNode(RelOptInfo) would set
most of the fields to 0 or NULL. Why do we then again set those fields
as NULL or 0? Should I try to remove unnecessary assignments?

+     * Produce partition-wise joinrel's targetlist by translating the parent
+     * joinrel's targetlist. This will also include the required placeholder

Again the confusion between a "child" join and a partition-wise join...

+    /*
+     * Nothing to do if
+     * a. partition-wise join is disabled.
+     * b. joining relations are not partitioned.
+     * c. partitioning schemes do not match.
+     */
+

I don't think that's going to survive pgindent.

Changed this code a bit.

+ * are not considered equal, an equi-join involing inner partition keys

Spelling.

+     * Collect the partition key expressions. An OUTER join will produce rows
+     * where the partition key columns of inner side are NULL and may not fit
+     * the partitioning scheme with inner partition keys. Since two NULL values
+     * are not considered equal, an equi-join involing inner partition keys
+     * still prohibits cross-partition joins while joining with another
+     * similarly partitioned relation.

I can't figure out what this comment is trying to tell me. Possibly I
just need more caffeine.

Re-wrote the comment with examples and detailed explanation. The
comment talks about whether inner partition key expressions should be
considered as the partition key expressions for the join, given that
for an OUTER join the inner partition key expressions may go NULL. The
comment explains why it's safe to do so. If we don't do that, any FULL
OUTER join will have no partition expressions and thus partition-wise
join technique will be useless for a N-way FULL OUTER join even if
it's safe to use it.

+ * Adding these two join_rel_level list also means that top level list has more
+ * than one join relation, which is symantically incorrect.

I don't understand this, either; also, spelling.

I think, that sentence is not required. Removed it.

As a general comment, the ratio of tests-to-code in this patch is way
out of line with PostgreSQL's normal practices. The total patch file
is 10965 lines. The test cases begin at line 3047, meaning that in
round figures you've got about one-quarter code and about
three-quarters test cases. I suspect that a large fraction of those
test cases aren't adding any meaningful code coverage and will just
take work to maintain. That needs to be slimmed down substantially in
any version of this considered for commit.

I agree. We require two kinds of tests 1. those which test partition
scheme matching 2. those test the planner code, which deals with path
creation. I have added both kinds of testcases for all kinds of
partitioning schemes (range, list, multi-level, partition key being
expressions, columns). That's not required. We need 1st kind of tests
for all partitioning schemes and 2nd kind of testcases only for one of
the partitioning schemes. So, definitely the number of tests will
reduce. A possible extreme would be to use a single multi-level
partitioned tests, which includes all kinds of partitioning schemes at
various partition levels. But that kind of testcase will be highly
unreadable and harder to maintain. Let me know what do you think. I
will work on that in the next version of patch. The test still fails
because of a bug in Amit's earlier set of patches

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_v4.patchtext/plain; charset=US-ASCII; name=pg_dp_join_v4.patch
#18Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#17)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions? If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly? Quadratically? Exponentially? Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.

I tried to check memory usage with various combinations of number of
partitions and number of relations being joined. For higher number of
relations being joined like 10 with 100 partitions, OOM killer kicked
in during the planning phase. I am suspecting
adjust_partitionrel_attrs() (changed that name to
adjust_join_appendrel_attrs() to be in sync with
adjust_appendrel_attrs()) to be the culprit. It copies expression
trees every time for joining two children. That's an exponentially
increasing number as the number of legal joins increases
exponentially. I am still investigating this.

I think the root of this problem is that the existing paths shares a
lot more substructure than the ones created by the new code. Without
a partition-wise join, the incremental memory usage for a joinrel
isn't any different whether the underlying rel is partitioned or not.
If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
be pointing to some kind of Scan. But the join itself creates exactly
the same amount of new stuff regardless of what's underneath it. With
partitionwise join, that ceases to be true. Every joinrel - and the
number of those grows exponentially in the number of baserels, IICU -
needs its own list of paths for every member rel. So if a
non-partition-wise join created X paths, and there are K partitions, a
partition-wise join creates X * K paths. That's a lot.

Although we might be able to save some memory by tightening things up
here and there - for example, right now the planner isn't real smart
about recycling paths that are evicted by add_path(), and there's
probably other wastage as well - I suspect that what this shows is
that the basic design of this patch is not going to be viable.
Intuitively, it's often going to be the case that we want the "same
plan" for every partition-set. That is, if we have A JOIN B ON A.x =
B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
partitioned, then the result should be an Append plan with 100 join
plans under it, and all 100 of those plans should be basically mirror
images of each other. Of course, that's not really right in general:
for example, it could be that A1 is big and A2 is small while B1 is
small and B2 is big, so that the right plan for (A1 JOIN B1) and for
(A2 JOIN B2) are totally different from each other. But in many
practical cases we'll want to end up with a plan of precisely the same
shape for all children, and the current design ignores this, expending
both memory and CPU time to compute essentially-equivalent paths
across all children.

One way of attacking this problem is to gang together partitions which
are equivalent for planning purposes, as discussed in the paper "Join
Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
and Babu. However, it's not exactly clear how to do this: we could
gang together partitions that have the same index definitions, but the
sizes of the heaps, the sizes of their indexes, and the row counts
will vary from one partition to the next, and any of those things
could cause the plan choice to be different for one partition vs. the
next. We could try to come up with heuristics for when those things
are likely to be true. For example, suppose we compute the set of
partitions such that all joined relations have matching index
definitions on all tables; then, we take the biggest table in the set
and consider all tables more than half that size as part of one gang.
The biggest table becomes the leader and we compute partition-wise
paths for just that partition; the other members of the gang will
eventually get a plan that is of the same shape, but we don't actually
create it that plan until after scan/join planning is concluded.

Another idea is to try to reduce peak memory usage by performing
planning separately for each partition-set. For example, suppose we
decide to do a partition-wise join of A, B, and C. Initially, this
gets represented as a PartitionJoinPath tree, like this:

PartitionJoinPath
-> AppendPath for A
-> PartitionJoinPath
-> AppendPath for B
-> AppendPath for C

Because we haven't created individual join paths for the members, this
doesn't use much memory. Somehow, we come up with a cost for the
PartitionJoinPath; it probably won't be entirely accurate. Once
scan/join planning is concluded, if our final path contains a
PartitionJoinPath, we go back and loop over the partitions. For each
partition, we switch to a new memory context, perform planning, copy
the best path and its substructure back to the parent context, and
then reset the context. In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#18)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Have you tested the effect of this patch on planner memory consumption
with multi-way joins between tables with many partitions? If you
haven't, you probably should. (Testing runtime would be good, too.)
Does it grow linearly? Quadratically? Exponentially? Minor leaks
don't matter, but if we're generating too much garbage we'll have to
make sure it gets cleaned up soon enough to prevent runaway memory
usage.

I tried to check memory usage with various combinations of number of
partitions and number of relations being joined. For higher number of
relations being joined like 10 with 100 partitions, OOM killer kicked
in during the planning phase. I am suspecting
adjust_partitionrel_attrs() (changed that name to
adjust_join_appendrel_attrs() to be in sync with
adjust_appendrel_attrs()) to be the culprit. It copies expression
trees every time for joining two children. That's an exponentially
increasing number as the number of legal joins increases
exponentially. I am still investigating this.

I think the root of this problem is that the existing paths shares a
lot more substructure than the ones created by the new code. Without
a partition-wise join, the incremental memory usage for a joinrel
isn't any different whether the underlying rel is partitioned or not.
If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
be pointing to some kind of Scan. But the join itself creates exactly
the same amount of new stuff regardless of what's underneath it. With
partitionwise join, that ceases to be true. Every joinrel - and the
number of those grows exponentially in the number of baserels, IICU -
needs its own list of paths for every member rel. So if a
non-partition-wise join created X paths, and there are K partitions, a
partition-wise join creates X * K paths. That's a lot.

Although we might be able to save some memory by tightening things up
here and there - for example, right now the planner isn't real smart
about recycling paths that are evicted by add_path(), and there's
probably other wastage as well - I suspect that what this shows is
that the basic design of this patch is not going to be viable.
Intuitively, it's often going to be the case that we want the "same
plan" for every partition-set. That is, if we have A JOIN B ON A.x =
B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
partitioned, then the result should be an Append plan with 100 join
plans under it, and all 100 of those plans should be basically mirror
images of each other. Of course, that's not really right in general:
for example, it could be that A1 is big and A2 is small while B1 is
small and B2 is big, so that the right plan for (A1 JOIN B1) and for
(A2 JOIN B2) are totally different from each other. But in many
practical cases we'll want to end up with a plan of precisely the same
shape for all children, and the current design ignores this, expending
both memory and CPU time to compute essentially-equivalent paths
across all children.

I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1]http://postgresql.nabble.com/design-for-a-partitioning-feature-was-inheritance-td5921603.html as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

One way of attacking this problem is to gang together partitions which
are equivalent for planning purposes, as discussed in the paper "Join
Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
and Babu. However, it's not exactly clear how to do this: we could
gang together partitions that have the same index definitions, but the
sizes of the heaps, the sizes of their indexes, and the row counts
will vary from one partition to the next, and any of those things
could cause the plan choice to be different for one partition vs. the
next. We could try to come up with heuristics for when those things
are likely to be true. For example, suppose we compute the set of
partitions such that all joined relations have matching index
definitions on all tables; then, we take the biggest table in the set
and consider all tables more than half that size as part of one gang.
The biggest table becomes the leader and we compute partition-wise
paths for just that partition; the other members of the gang will
eventually get a plan that is of the same shape, but we don't actually
create it that plan until after scan/join planning is concluded.

Section 5 of that paper talks about clustering partitions together for
joining, only when there is 1:m OR n:1 partition matching for join. In
such a case, it clusters all the partitions from one relation that are
all joined with a single partition of the other relation. But I think
your idea to gang up partitions with similar properties may reduce the
number of paths we create but as you have mentioned how to gang them
up is not very clear. There are just too many factors like
availability of the indexes, sizes of tables, size of intermediate
results etc. which make it difficult to identify the properties used
for ganging up. Even after we do that, in the worst case, we will
still end up creating paths for all partitions of all joins, thus
causing increase in paths proportionate to the number of partitions.

In the section 6.3, the paper mentions that the number of paths
retained are linear in the number of child joins per parent join. So,
it's clear that the paper never considered linear increase in the
paths to be a problem or at least a problem that that work had to
solve. Now, it's surprising that their memory usage increased by 7% to
10%. But 1. they might be measuring total memory and not the memory
used by the planner and they experimented with PostgreSQL 8.3.7, which
probably tried much less number of paths than the current optimizer.

Another idea is to try to reduce peak memory usage by performing
planning separately for each partition-set. For example, suppose we
decide to do a partition-wise join of A, B, and C. Initially, this
gets represented as a PartitionJoinPath tree, like this:

PartitionJoinPath
-> AppendPath for A
-> PartitionJoinPath
-> AppendPath for B
-> AppendPath for C

Because we haven't created individual join paths for the members, this
doesn't use much memory. Somehow, we come up with a cost for the
PartitionJoinPath; it probably won't be entirely accurate. Once
scan/join planning is concluded, if our final path contains a
PartitionJoinPath, we go back and loop over the partitions.

A typical join tree will be composite: some portion partitioned and
some portion unpartitioned or different portions partitioned by
different partition schemes. In such case, inaccurate costs for
PartitionJoinPath, can affect the plan heavily, causing a suboptimal
path to be picked. Assuming that partitioning will be useful for large
sets of data, choosing a suboptimal plan can be more dangerous than
consuming memory for creating paths.

If we could come up with costs for PartitionJoinPath using some
methods of interpolation, say by sampling few partitions and then
extrapolating their costs for entire PartitionJoinPath, we can use
this method. But unless the partitions have very similar
characteristics or have such characteristics that costs can be guessed
based on the differences between the characteristics, I do not see how
that can happen. For example, while costing a PartitionJoinPath with
pathkeys, the costs will change a lot based on whether underlying
relations have indexes, or which join methods are used, which in turn
is based on properties on the partitions. Same is the case for paths
with parameterization. All such paths are important when a partitioned
join relation joins with other unpartitioned relation or a partitioned
relation with different partitioning scheme.

When each partition of base relation being joined has different
properties, the cost for join between one set of partitions can differ
from join between other set of partitions. Not only that, the costs
for various properties of resultant paths like pathkeys,
parameterization can vary a lot, depending upon the available indexes
and estimates of rows for each join. So, we need to come up with these
cost estimates separately for each join between partitions to come up
with cost of each PartitionJoinPath. If we have to calculate those
costs to create PartitionJoinPath, we better save them in paths rather
than recalculating them in the second round of planning for joins
between partitions.

For each
partition, we switch to a new memory context, perform planning, copy
the best path and its substructure back to the parent context, and
then reset the context.

This could be rather tricky. It assumes that all the code that creates
paths for joins, should not allocate any memory which is linked to
some object in a context that lives longer than the path creation
context. There is some code like create_join_clause() or
make_canonical_pathkey(), which carefully chooses which memory context
to allocate memory in. But can we ensure it always? postgres_fdw for
example allocates memory for PgFdwRelationInfo in current memory
context and attaches it in RelOptInfo, which should be in the
planner's original context. So, if we create a new memory context for
each partition, fpinfos would be invalidated when those contexts are
released. Not that, we can not enforce some restriction on the memory
usage while planning, it's hard to enforce it and bugs arising from it
may go unnoticed. GEQO planner might have its own problems with this
approach. Third party FDWs will pose a problem.

A possible solution would be to keep the track of used paths using a
reference count. Once the paths for given join tree are created, free
up the unused paths by traversing pathlist in each of the RelOptInfos.
Attached patch has a prototype implementation for the same. There are
some paths which are not linked to RelOptInfos, which need a bit
different treatment, but they can be handled too.

In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

For the case of carefully hand-crafted partitions, I think, users
would expect the planner to use really the best plan and thus may be
willing to accommodate for increased memory usage. Going by any
approach that does not create the paths for joins between partitions
is not guaranteed to give the best plan. Users willing to provide
increased memory will be unhappy if we do not give them the best path.

The user who creates hundreds of partitions, will ideally be using
pretty powerful servers with a lot of memory. On such servers, the
linear increase in memory for paths may not be as bad as you are
portraying above, as long as its producing the best plan.

Just joining partitioned tables with hundreds of partitions does not
increase the number of paths. Number of paths increases when two
partitioned tables with similar partitioning scheme are joined with
equality condition on partition key. Unless we consider
repartitioning, how many of the joining relations share same
partitioning scheme? Section 8.6 mentions, "no TPC-H query plan,
regardless of the partitioning scheme, contains n-way child joins for
n >= 4. Maximum partitions that the paper mentions is 168 (Table 3).
My VM which has 8GB RAM and 4 cores handled that case pretty well. We
may add logic to free up space used by useless paths post-join to free
up some memory for next stages of query execution.

There will still be users, for whom the increase in the memory usage
is unexpected. Those will need to be educated or for them we might
take heuristic PartitionJoinPath based approach discussed above. But I
don't think that heuristic approach should be the default case. May be
we should supply a GUC which can switch between the approaches.

Some ideas for GUCs are 1. delay_partition_wise_join - when ON uses
the heuristic approach of PartitionJoinPath.
2. A GUC similar to join_collapse_limit may be used to limit the
number of partitioned relations being joined using partition-wise join
technique. A value of 1, indicates enable_partition_wise_join = false.
So, we may replace enable_partition_wise_join withe this GUC.
3. A GUC max_joinable_partitions (open to suggestions for name) may
specify the maximum number of partitions that two relations may have
to be eligible for partition-wise join.

I guess, using these GUCs allows a user handle the trade-off between
getting the best plan and memory usage consciously. I think, users
would like to accept a suboptimal plans consciously than being thrown
a suboptimal plan without choice.

[1]: http://postgresql.nabble.com/design-for-a-partitioning-feature-was-inheritance-td5921603.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

free_unused_paths.patchtext/x-patch; charset=US-ASCII; name=free_unused_paths.patch
#20Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#19)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1] as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

I'm not sure I see a real distinction between these two use cases. I
think that the problem of differing data distribution between
partitions is almost always going to be an issue. Take the simple
case of an "orders" table which is partitioned by month. First, the
month that's currently in progress may be much smaller than a typical
completed month. Second, many businesses are seasonal and may have
many more orders at certain times of year. For example, in American
retail, many businesses have large spikes in December. I think some
businesses may do four times as much business in December as any other
month, for example. So you will have that sort of variation, at
least.

A typical join tree will be composite: some portion partitioned and
some portion unpartitioned or different portions partitioned by
different partition schemes. In such case, inaccurate costs for
PartitionJoinPath, can affect the plan heavily, causing a suboptimal
path to be picked. Assuming that partitioning will be useful for large
sets of data, choosing a suboptimal plan can be more dangerous than
consuming memory for creating paths.

Well, sure. But, I mean, every simplifying assumption which the
planner makes to limit resource consumption could have that effect.
join_collapse_limit, for example, can cause horrible plans. However,
we have it anyway, because the alternative of having planning take far
too long is unpalatable. Planning is always, at some level,
guesswork.

For each
partition, we switch to a new memory context, perform planning, copy
the best path and its substructure back to the parent context, and
then reset the context.

This could be rather tricky. It assumes that all the code that creates
paths for joins, should not allocate any memory which is linked to
some object in a context that lives longer than the path creation
context. There is some code like create_join_clause() or
make_canonical_pathkey(), which carefully chooses which memory context
to allocate memory in. But can we ensure it always? postgres_fdw for
example allocates memory for PgFdwRelationInfo in current memory
context and attaches it in RelOptInfo, which should be in the
planner's original context. So, if we create a new memory context for
each partition, fpinfos would be invalidated when those contexts are
released. Not that, we can not enforce some restriction on the memory
usage while planning, it's hard to enforce it and bugs arising from it
may go unnoticed. GEQO planner might have its own problems with this
approach. Third party FDWs will pose a problem.

Yep, there are problems. :-)

A possible solution would be to keep the track of used paths using a
reference count. Once the paths for given join tree are created, free
up the unused paths by traversing pathlist in each of the RelOptInfos.
Attached patch has a prototype implementation for the same. There are
some paths which are not linked to RelOptInfos, which need a bit
different treatment, but they can be handled too.

So, if you apply this with your previous patch, how much does it cut
down memory consumption?

In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

For the case of carefully hand-crafted partitions, I think, users
would expect the planner to use really the best plan and thus may be
willing to accommodate for increased memory usage. Going by any
approach that does not create the paths for joins between partitions
is not guaranteed to give the best plan. Users willing to provide
increased memory will be unhappy if we do not give them the best path.

The user who creates hundreds of partitions, will ideally be using
pretty powerful servers with a lot of memory. On such servers, the
linear increase in memory for paths may not be as bad as you are
portraying above, as long as its producing the best plan.

No, I don't agree. We should be trying to build something that scales
well. I've heard reports of customers with hundreds or even thousands
of partitions; I think it is quite reasonable to think that we need to
scale to 1000 partitions. If we use 3MB of memory to plan a query
involving unpartitioned, using 3GB to plan a query where the main
tables have been partitioned 1000 ways does not seem reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#20)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Oct 31, 2016 at 6:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1] as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

I'm not sure I see a real distinction between these two use cases. I
think that the problem of differing data distribution between
partitions is almost always going to be an issue. Take the simple
case of an "orders" table which is partitioned by month. First, the
month that's currently in progress may be much smaller than a typical
completed month. Second, many businesses are seasonal and may have
many more orders at certain times of year. For example, in American
retail, many businesses have large spikes in December. I think some
businesses may do four times as much business in December as any other
month, for example. So you will have that sort of variation, at
least.

A typical join tree will be composite: some portion partitioned and
some portion unpartitioned or different portions partitioned by
different partition schemes. In such case, inaccurate costs for
PartitionJoinPath, can affect the plan heavily, causing a suboptimal
path to be picked. Assuming that partitioning will be useful for large
sets of data, choosing a suboptimal plan can be more dangerous than
consuming memory for creating paths.

Well, sure. But, I mean, every simplifying assumption which the
planner makes to limit resource consumption could have that effect.
join_collapse_limit, for example, can cause horrible plans. However,
we have it anyway, because the alternative of having planning take far
too long is unpalatable. Planning is always, at some level,
guesswork.

My point is, this behaviour is configurable. Users who are ready to
spend time and resources to get the best plan are still able to do so,
by choosing a higher limit on join_collapse_limit. Those who can not
afford to do so, are ready to use inferior plans willingly by setting
join_collapse_limit to a lower number.

A possible solution would be to keep the track of used paths using a
reference count. Once the paths for given join tree are created, free
up the unused paths by traversing pathlist in each of the RelOptInfos.
Attached patch has a prototype implementation for the same. There are
some paths which are not linked to RelOptInfos, which need a bit
different treatment, but they can be handled too.

So, if you apply this with your previous patch, how much does it cut
down memory consumption?

Answered this below:

In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

For the case of carefully hand-crafted partitions, I think, users
would expect the planner to use really the best plan and thus may be
willing to accommodate for increased memory usage. Going by any
approach that does not create the paths for joins between partitions
is not guaranteed to give the best plan. Users willing to provide
increased memory will be unhappy if we do not give them the best path.

The user who creates hundreds of partitions, will ideally be using
pretty powerful servers with a lot of memory. On such servers, the
linear increase in memory for paths may not be as bad as you are
portraying above, as long as its producing the best plan.

No, I don't agree. We should be trying to build something that scales
well. I've heard reports of customers with hundreds or even thousands
of partitions; I think it is quite reasonable to think that we need to
scale to 1000 partitions. If we use 3MB of memory to plan a query
involving unpartitioned, using 3GB to plan a query where the main
tables have been partitioned 1000 ways does not seem reasonable to me.

Here are memory consumption numbers.

For a simple query "select * from v5_prt100", where v5_prt100 is a
view on a 5 way self join of table prt100, which is a plain table with
100 partitions without any indexes.
postgres=# \d+ v5_prt100
View "part_mem_usage.v5_prt100"
Column | Type | Modifiers | Storage | Description
--------+--------+-----------+----------+-------------
t1 | prt100 | | extended |
t2 | prt100 | | extended |
t3 | prt100 | | extended |
t4 | prt100 | | extended |
t5 | prt100 | | extended |
View definition:
SELECT t1.*::prt100 AS t1,
t2.*::prt100 AS t2,
t3.*::prt100 AS t3,
t4.*::prt100 AS t4,
t5.*::prt100 AS t5
FROM prt100 t1,
prt100 t2,
prt100 t3,
prt100 t4,
prt100 t5
WHERE t1.a = t2.a AND t2.a = t3.a AND t3.a = t4.a AND t4.a = t5.a;

postgres=# \d prt100
Table "part_mem_usage.prt100"
Column | Type | Modifiers
--------+-------------------+-----------
a | integer |
b | integer |
c | character varying |
Partition Key: RANGE (a)
Number of partitions: 100 (Use \d+ to list them.)

Without partition-wise join the standard_planner() consumes 4311 kB
memory of which 150 kB is consumed in add_paths_to_joinrel().

With partition-wise join standard_planner() consumes 65MB memory,
which is 16 times more (not 100 times more as you suspected above). Of
this bloat 16MB is consumed for creating child join paths whereas
651kB is consumed in creating append paths. That's 100 times bloat for
path creation. Rest of the memory bloat is broken down as 9MB to
create child join RelOptInfos, 29MB to translate restrict clauses, 8MB
to translate target lists. 2MB for creating special join info for
children, 2MB goes into creating plans.

If we apply logic to free unused paths, the memory consumption reduces
as follows

Without partition-wise join standard_planner() consumes 4268 kB
(against 4311kB earlier) of which 123kB (against 150kB earlier) is
consumed in add_paths_to_joinrel().

With partition-wise join, standard_planner() consumes 63MB (against
65MB earlier). Child join paths still consume 13 MB (against 16MB
earlier), which is still 100 times that without using partition-wise
join. We may shave off some memory consumption by using better methods
than translating expressions, but we will continue to have bloats
introduced by paths, RelOptInfos for child joins etc.

So, I am thinking about your approach of creating PartitionJoinPaths
without actually creating child paths and then at a later stage
actually plan the child joins. Here's rough sketch of how that may be
done.

At the time of creating regular paths, we identify the join orders
which can use partition-wise join and save those in the RelOptInfo of
the parent table. If no such join order exists, we do not create
PartitionJoinPaths for that relation. Otherwise, once we have
considered all the join orders i.e. in
generate_partition_wise_join_paths(), we create one PartitionJoinPath
for every path that has survived in the parent or at least for every
path that has distinct properties like pathkeys or parameterisation,
with those properties.

At the time of creating plans, if PartitionJoinPath is chosen, we
actually create paths for every partition of that relation
recursively. The path creation logic is carried out in a different
memory context. Amongst the paths that survive, we choose the best
path that has the same properties as PartitionJoinPath. We would
expect all parameterized paths to be retained and any unparameterized
path can be sorted to match the pathkeys of reference
PartitionJoinPath. We then create the plan out of this path and copy
it into the outer memory context and release the memory context used
for path creation. This is similar to how prepared statements save
their plans. Once we have the plan, the memory consumed by paths won't
be referenced, and hence can not create problems. At the end we create
an Append/MergeAppend plan with all the child plans and return it.

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.
Take example of a join where the joining relations are very small in
size, thus hash join on full relation is optimal compared to hash join
of each partition because of setup cost. In such a case, the function
which calculates the cost of hash table setup, would result in almost
same cost for full table as well as each of the partitions, thus
increasing P * S(M/P, N/P) as compared to S(M, N).

Let me know your comments.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#21)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

So, I am thinking about your approach of creating PartitionJoinPaths
without actually creating child paths and then at a later stage
actually plan the child joins. Here's rough sketch of how that may be
done.

At the time of creating regular paths, we identify the join orders
which can use partition-wise join and save those in the RelOptInfo of
the parent table. If no such join order exists, we do not create
PartitionJoinPaths for that relation. Otherwise, once we have
considered all the join orders i.e. in
generate_partition_wise_join_paths(), we create one PartitionJoinPath
for every path that has survived in the parent or at least for every
path that has distinct properties like pathkeys or parameterisation,
with those properties.

At the time of creating plans, if PartitionJoinPath is chosen, we
actually create paths for every partition of that relation
recursively. The path creation logic is carried out in a different
memory context. Amongst the paths that survive, we choose the best
path that has the same properties as PartitionJoinPath. We would
expect all parameterized paths to be retained and any unparameterized
path can be sorted to match the pathkeys of reference
PartitionJoinPath. We then create the plan out of this path and copy
it into the outer memory context and release the memory context used
for path creation. This is similar to how prepared statements save
their plans. Once we have the plan, the memory consumed by paths won't
be referenced, and hence can not create problems. At the end we create
an Append/MergeAppend plan with all the child plans and return it.

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.
Take example of a join where the joining relations are very small in
size, thus hash join on full relation is optimal compared to hash join
of each partition because of setup cost. In such a case, the function
which calculates the cost of hash table setup, would result in almost
same cost for full table as well as each of the partitions, thus
increasing P * S(M/P, N/P) as compared to S(M, N).

Let me know your comments.

I tried to measure the impact of having a memory context reset 1000
times (once for each partition) with the attached patch. Without this
patch make check in regress/ takes about 24 seconds on my laptop and
with this patch it takes 26 seconds. This is almost 10% increase in
time. I hope that's fine.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

memory_context_change.patchtext/x-patch; charset=US-ASCII; name=memory_context_change.patch
#23Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#21)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.

I'm not sure that I really understand the stuff with big-O notation
and M, N, and P. But I think what you are saying is that we could
cost a PartitionJoinPath by costing some of the partitions (it might
be a good idea to choose the biggest ones) and assuming the cost for
the remaining ones will be roughly proportional. That does seem like
a reasonable strategy to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#23)
Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.

I'm not sure that I really understand the stuff with big-O notation
and M, N, and P. But I think what you are saying is that we could
cost a PartitionJoinPath by costing some of the partitions (it might
be a good idea to choose the biggest ones) and assuming the cost for
the remaining ones will be roughly proportional. That does seem like
a reasonable strategy to me.

I'm not sure to what extent the above argument depends on the assumption
that join is O(MN), but I will point out that in no case of practical
interest for large tables is it actually O(MN). That would be true
only for the stupidest possible nested-loop join method. It would be
wise to convince ourselves that the argument holds for more realistic
big-O costs, eg hash join is more like O(M+N) if all goes well.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#24)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Nov 14, 2016 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Nov 4, 2016 at 6:52 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.

I'm not sure that I really understand the stuff with big-O notation
and M, N, and P. But I think what you are saying is that we could
cost a PartitionJoinPath by costing some of the partitions (it might
be a good idea to choose the biggest ones) and assuming the cost for
the remaining ones will be roughly proportional. That does seem like
a reasonable strategy to me.

I'm not sure to what extent the above argument depends on the assumption
that join is O(MN), but I will point out that in no case of practical
interest for large tables is it actually O(MN). That would be true
only for the stupidest possible nested-loop join method. It would be
wise to convince ourselves that the argument holds for more realistic
big-O costs, eg hash join is more like O(M+N) if all goes well.

Yeah, I agree. To recap briefly, the problem we're trying to solve
here is how to build a path for a partitionwise join without an
explosion in the amount of memory the planner uses or the number of
paths created. In the initial design, if there are N partitions per
relation, the total number of paths generated by the planner increases
by a factor of N+1, which gets ugly if, say, N = 1000, or even N =
100. To reign that in, we want to do a rough cut at costing the
partitionwise join that will be good enough to let us throw away
obviously inferior paths, and then work out the exact paths we're
going to use only for partitionwise joins that are actually selected.
I think costing one or a few of the larger sub-joins and assuming
those costs are representative is probably a reasonable approach to
that problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#23)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi Robert,
Sorry for delayed response.

The attached patch implements following ideas:
1. At the time of creating paths - If the joining relations are both
partitioned and join can use partition-wise join, we create paths for
few child-joins. Similar to inheritance relations
(set_append_rel_pathlist()), we collect paths with similar properties
from all sampled child-joins and create one PartitionJoinPath with
each set of paths. The cost of the PartitionJoinPath is obtained by
multiplying the sum of costs of paths in the given set by the ratio of
(number of rows estimated in the parent-join/sum of rows in
child-joins).

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Right now, we choose 1% or 1 (whichever is higher) child-joins to base
PartitionJoinPath costs on.

Memory consumption
-----------------------------
I tested a 5-way self-join for a table with 1000 partitions, each
partition having 1M rows. The memory consumed in standard_planner()
was measured with some granular tracking
(mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
consumed total of 289MB memory which is approx 6.6 times more than
non-partition-wise join which consumed 44MB. That's much better than
the earlier 16 times consumption for 5-way join with 100 partitions.

The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
SpecialJoinInfos for child-joins (64MB), restrictlist translation
(92MB), paths for sampled child-joins (1.5MB), building targetlists
for child-joins (7MB).

In order to choose representative child-joins based on the sizes of
child-joins, we need to create all the child-join RelOptInfos. In
order to estimate sizes of child-joins, we need to create
SpecialJoinInfos and restrictlists for at least one join order for all
child-joins. For every representative child-join, we need to create
SpecialJoinInfo and restrictlist for all join orders for that
child-join. We might be able to save of restrictlist translation, if
we create restrict lists from joininfo similar to parent joins. I
haven't tried that yet.

Choosing representative child-joins:
--------------------------------------------------
There's another angle to choosing representative child joins. In a
partitioned N-way join, different joins covering different subsets of
N relations, will have different size distributions across the
partitions. This means that the child-joins costed for (N-k) joins,
may be different for those required for (N-k+1) joins. With a factor
of 1% sampling, N is such that a child-join participates in 100 joins,
we will end up creating paths for all partitions before creating
PartitionJoinPaths for the final N-way join. Hopefully that will be a
rare case and usually we will end up using paths already created. We
can not avoid creating PartitionJoinPaths for subset joins, as there
might be cases when partition-wise join will be optimal for an N-k way
join but not for N-way join. We may avoid this if we choose
representative child-joins based on their positions, in which case, we
may end up with some or all of those being empty and thus skewing the
costs heavily.

Partial paths
-----------------
AFAIU, we create partial paths for append relation, when all the
children have partial paths. Unlike parameterized paths or path with
pathkeys, there is no way to create a partial path for a normal path.
This means that unless we create paths for all child-joins, we can not
create partial paths for appendrel comprising of child-joins, and thus
can not use parallel query right now. This may not be that bad, since
it would be more efficient to run each child-join in a separate
worker, rather than using multiple workers for a single child-join.

regression tests
----------------------
I observed that for small relations (1000 rows in each partition and
100 partitions), the size estimates in append relations and sum of
those in child relations are very different. As a result, the
extrapolated costs for PartitionJoinPaths as described above, are way
higher than costs of join of appends (or even append of joins if we
are to create paths for all child-joins). Thus with this approach, we
choose partition-wise join for large number of partitions with large
data (e.g. 1000 partitions with 1M rows each). These are certainly the
cases when partition-wise join is a big win. I have not tried to find
out a threshold above which partition-wise join gets chosen with above
approach, but it's going to be a larger threshold. That makes writing
regression tests difficult, as those will require large data. So, we
have to find a way so that we can test partition-wise join with
smaller data. There are few possibilities like 1. convert the fraction
of representative child-joins into GUC and setting it to 100% would
start choosing partition-wise joins for tables with a few hundred rows
per partition, like it did in earlier approach, 2. provide a way to
force partition-wise join whenever possible, by say costing
partition-wise joins much lesser than non-partition-wise join when a
GUC is set (e.g. enable_partition_wise_join with values always, never,
optimal or something like that).

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

mem_usage_func_wise_measurement_slabwise.patchbinary/octet-stream; name=mem_usage_func_wise_measurement_slabwise.patch
pg_dp_join_v5.patchbinary/octet-stream; name=pg_dp_join_v5.patch
#27Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#26)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

PFA patch rebased after partitioning code was committed.

On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Hi Robert,
Sorry for delayed response.

The attached patch implements following ideas:
1. At the time of creating paths - If the joining relations are both
partitioned and join can use partition-wise join, we create paths for
few child-joins. Similar to inheritance relations
(set_append_rel_pathlist()), we collect paths with similar properties
from all sampled child-joins and create one PartitionJoinPath with
each set of paths. The cost of the PartitionJoinPath is obtained by
multiplying the sum of costs of paths in the given set by the ratio of
(number of rows estimated in the parent-join/sum of rows in
child-joins).

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Right now, we choose 1% or 1 (whichever is higher) child-joins to base
PartitionJoinPath costs on.

Memory consumption
-----------------------------
I tested a 5-way self-join for a table with 1000 partitions, each
partition having 1M rows. The memory consumed in standard_planner()
was measured with some granular tracking
(mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
consumed total of 289MB memory which is approx 6.6 times more than
non-partition-wise join which consumed 44MB. That's much better than
the earlier 16 times consumption for 5-way join with 100 partitions.

The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
SpecialJoinInfos for child-joins (64MB), restrictlist translation
(92MB), paths for sampled child-joins (1.5MB), building targetlists
for child-joins (7MB).

In the earlier implementation, a given clause which was applicable to
multiple join orders was getting translated as many times as the join
orders it was applicable in. I changed RestrictInfo for parent to
store a list of RestrictInfos applicable to children to avoid multiple
translations.

My earlier patch created the child-join plans in a temporary context
and then copied them into planner context since the translated clauses
were allocated memory in temporary memory context then. Now that they
are stored in planner's context, we can directly create the plan in
the planner's context.

Third, I added code to free up child SpecialJoinInfos after using those.

As a result the total memory consumption now is 192MB, which is approx
4.4 times the memory consumed during planning in case of
non-partition-wise join.

Choosing representative child-joins:
--------------------------------------------------
There's another angle to choosing representative child joins. In a
partitioned N-way join, different joins covering different subsets of
N relations, will have different size distributions across the
partitions. This means that the child-joins costed for (N-k) joins,
may be different for those required for (N-k+1) joins. With a factor
of 1% sampling, N is such that a child-join participates in 100 joins,
we will end up creating paths for all partitions before creating
PartitionJoinPaths for the final N-way join. Hopefully that will be a
rare case and usually we will end up using paths already created. We
can not avoid creating PartitionJoinPaths for subset joins, as there
might be cases when partition-wise join will be optimal for an N-k way
join but not for N-way join. We may avoid this if we choose
representative child-joins based on their positions, in which case, we
may end up with some or all of those being empty and thus skewing the
costs heavily.

Partial paths
-----------------
AFAIU, we create partial paths for append relation, when all the
children have partial paths. Unlike parameterized paths or path with
pathkeys, there is no way to create a partial path for a normal path.
This means that unless we create paths for all child-joins, we can not
create partial paths for appendrel comprising of child-joins, and thus
can not use parallel query right now. This may not be that bad, since
it would be more efficient to run each child-join in a separate
worker, rather than using multiple workers for a single child-join.

This still applies.

regression tests
----------------------
I observed that for small relations (1000 rows in each partition and
100 partitions), the size estimates in append relations and sum of
those in child relations are very different. As a result, the
extrapolated costs for PartitionJoinPaths as described above, are way
higher than costs of join of appends (or even append of joins if we
are to create paths for all child-joins). Thus with this approach, we
choose partition-wise join for large number of partitions with large
data (e.g. 1000 partitions with 1M rows each). These are certainly the
cases when partition-wise join is a big win. I have not tried to find
out a threshold above which partition-wise join gets chosen with above
approach, but it's going to be a larger threshold. That makes writing
regression tests difficult, as those will require large data. So, we
have to find a way so that we can test partition-wise join with
smaller data. There are few possibilities like 1. convert the fraction
of representative child-joins into GUC and setting it to 100% would
start choosing partition-wise joins for tables with a few hundred rows
per partition, like it did in earlier approach, 2. provide a way to
force partition-wise join whenever possible, by say costing
partition-wise joins much lesser than non-partition-wise join when a
GUC is set (e.g. enable_partition_wise_join with values always, never,
optimal or something like that).

For now I have added a float GUC partition_wise_plan_weight. The
partition-wise join cost derived from the samples is multiplied by
this GUC and set as the cost of ParitionJoinPath. A value of 1 means
that the cost derived from the samples are used as is. A value higher
than 1 discourages use of partition-wise join and that lower than 1
encourages use of partition-wise join. I am not very keen on keeping
this GUC, in this form. But we need some way to run regression with
smaller data.

For now I have disabled partition-wise join for multi-level
partitions. I will post a patch soon with that enabled.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_v5.patchbinary/octet-stream; name=pg_dp_join_v5.patch
#28Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#27)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Dec 27, 2016 at 11:01 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

PFA patch rebased after partitioning code was committed.

On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Hi Robert,
Sorry for delayed response.

The attached patch implements following ideas:
1. At the time of creating paths - If the joining relations are both
partitioned and join can use partition-wise join, we create paths for
few child-joins. Similar to inheritance relations
(set_append_rel_pathlist()), we collect paths with similar properties
from all sampled child-joins and create one PartitionJoinPath with
each set of paths. The cost of the PartitionJoinPath is obtained by
multiplying the sum of costs of paths in the given set by the ratio of
(number of rows estimated in the parent-join/sum of rows in
child-joins).

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Right now, we choose 1% or 1 (whichever is higher) child-joins to base
PartitionJoinPath costs on.

Memory consumption
-----------------------------
I tested a 5-way self-join for a table with 1000 partitions, each
partition having 1M rows. The memory consumed in standard_planner()
was measured with some granular tracking
(mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
consumed total of 289MB memory which is approx 6.6 times more than
non-partition-wise join which consumed 44MB. That's much better than
the earlier 16 times consumption for 5-way join with 100 partitions.

The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
SpecialJoinInfos for child-joins (64MB), restrictlist translation
(92MB), paths for sampled child-joins (1.5MB), building targetlists
for child-joins (7MB).

In the earlier implementation, a given clause which was applicable to
multiple join orders was getting translated as many times as the join
orders it was applicable in. I changed RestrictInfo for parent to
store a list of RestrictInfos applicable to children to avoid multiple
translations.

My earlier patch created the child-join plans in a temporary context
and then copied them into planner context since the translated clauses
were allocated memory in temporary memory context then. Now that they
are stored in planner's context, we can directly create the plan in
the planner's context.

Third, I added code to free up child SpecialJoinInfos after using those.

As a result the total memory consumption now is 192MB, which is approx
4.4 times the memory consumed during planning in case of
non-partition-wise join.

Choosing representative child-joins:
--------------------------------------------------
There's another angle to choosing representative child joins. In a
partitioned N-way join, different joins covering different subsets of
N relations, will have different size distributions across the
partitions. This means that the child-joins costed for (N-k) joins,
may be different for those required for (N-k+1) joins. With a factor
of 1% sampling, N is such that a child-join participates in 100 joins,
we will end up creating paths for all partitions before creating
PartitionJoinPaths for the final N-way join. Hopefully that will be a
rare case and usually we will end up using paths already created. We
can not avoid creating PartitionJoinPaths for subset joins, as there
might be cases when partition-wise join will be optimal for an N-k way
join but not for N-way join. We may avoid this if we choose
representative child-joins based on their positions, in which case, we
may end up with some or all of those being empty and thus skewing the
costs heavily.

Partial paths
-----------------
AFAIU, we create partial paths for append relation, when all the
children have partial paths. Unlike parameterized paths or path with
pathkeys, there is no way to create a partial path for a normal path.
This means that unless we create paths for all child-joins, we can not
create partial paths for appendrel comprising of child-joins, and thus
can not use parallel query right now. This may not be that bad, since
it would be more efficient to run each child-join in a separate
worker, rather than using multiple workers for a single child-join.

This still applies.

regression tests
----------------------
I observed that for small relations (1000 rows in each partition and
100 partitions), the size estimates in append relations and sum of
those in child relations are very different. As a result, the
extrapolated costs for PartitionJoinPaths as described above, are way
higher than costs of join of appends (or even append of joins if we
are to create paths for all child-joins). Thus with this approach, we
choose partition-wise join for large number of partitions with large
data (e.g. 1000 partitions with 1M rows each). These are certainly the
cases when partition-wise join is a big win. I have not tried to find
out a threshold above which partition-wise join gets chosen with above
approach, but it's going to be a larger threshold. That makes writing
regression tests difficult, as those will require large data. So, we
have to find a way so that we can test partition-wise join with
smaller data. There are few possibilities like 1. convert the fraction
of representative child-joins into GUC and setting it to 100% would
start choosing partition-wise joins for tables with a few hundred rows
per partition, like it did in earlier approach, 2. provide a way to
force partition-wise join whenever possible, by say costing
partition-wise joins much lesser than non-partition-wise join when a
GUC is set (e.g. enable_partition_wise_join with values always, never,
optimal or something like that).

For now I have added a float GUC partition_wise_plan_weight. The
partition-wise join cost derived from the samples is multiplied by
this GUC and set as the cost of ParitionJoinPath. A value of 1 means
that the cost derived from the samples are used as is. A value higher
than 1 discourages use of partition-wise join and that lower than 1
encourages use of partition-wise join. I am not very keen on keeping
this GUC, in this form. But we need some way to run regression with
smaller data.

For now I have disabled partition-wise join for multi-level
partitions. I will post a patch soon with that enabled.

PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
on the latest code.

Also, PFA patch to support partition-wise join between multi-level
partitioned tables. I copied the Amit Langote's patch for translating
partition hierarchy into inheritance hierarchy and added code to
support partition-wise join. You had expressed some concerns about
Amit's approach in [1]/messages/by-id/CA+TgmoaEU10Kmdy44izcqJYLh1fkh58_6sbGGu0Q4b7PPE46eA@mail.gmail.com, but that discussion is still open. So, I
haven't merged those changes to partition-wise join patch. We may
continue to work on it as separate patch or I can include it in
partition-wise join main patch.

BTW, INSERT into multi-level partitioned tables is crashing with
latest head. The issue was reported in [2]/messages/by-id/CAKcux6=m1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw@mail.gmail.com. Because of that
multi_level_partition_join test crashes in pg_dp_join_v6.patch.
Intestingly the crash vanishes when we apply patch supporting
mult-level partition-wise join.

[1]: /messages/by-id/CA+TgmoaEU10Kmdy44izcqJYLh1fkh58_6sbGGu0Q4b7PPE46eA@mail.gmail.com
[2]: /messages/by-id/CAKcux6=m1qyqB2k6cjniuMMrYXb75O-MB4qGQMu8zg-iGGLjDw@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_v6.patchapplication/x-download; name=pg_dp_join_v6.patch
multi_level_partition_join.patchapplication/x-download; name=multi_level_partition_join.patch
#29Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#28)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
on the latest code.

Maybe not surprisingly given how fast things are moving around here
these days, this needs a rebase.

Apart from that, my overall comment on this patch is that it's huge:

37 files changed, 7993 insertions(+), 287 deletions(-)

Now, more than half of that is regression test cases and their output,
which you will certainly be asked to pare down in any version of this
intended for commit. But even excluding those, it's still a fairly
patch:

30 files changed, 2783 insertions(+), 272 deletions(-)

I think the reason this is so large is because there's a fair amount
of refactoring work that has been done as a precondition of the actual
meat of the patch, and no attempt has been made to separate the
refactoring work from the main body of the patch. I think that's
something that needs to be done. If you look at the way Amit Langote
submitted the partitioning patches and the follow-up bug fixes, he had
a series of patches 0001-blah, 0002-quux, etc. generated using
format-patch. Each patch had its own commit message written by him
explaining the purpose of that patch, links to relevant discussion,
etc. If you can separate this into more digestible chunks it will be
easier to get committed.

Other questions/comments:

Why does find_partition_scheme need to copy the partition bound
information instead of just pointing to it? Amit went to some trouble
to make sure that this can't change under us while we hold a lock on
the relation, and we'd better hold a lock on the relation if we're
planning a query against it.

I think the PartitionScheme stuff should live in the optimizer rather
that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
eventually need a new file in the optimizer just for partitioning
stuff, but I'm not sure about that yet.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable. You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

+                * For two partitioned tables with the same
partitioning scheme, it is
+                * assumed that the Oids of matching partitions from
both the tables
+                * are placed at the same position in the array of
partition oids in

Rather than saying that we assume this, you should say why it has to
be true. (If it doesn't have to be true, we shouldn't assume it.)

+                * join relations. Partition tables should have same
layout as the
+                * parent table and hence should not need any
translation. But rest of

The same attributes have to be present with the same types, but they
can be rearranged. This comment seems to imply the contrary.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+               /*
+                * Add this relation to the list of samples ordered by
the increasing
+                * number of rows at appropriate place.
+                */
+               foreach (lc, ordered_child_nos)
+               {
+                       int     child_no = lfirst_int(lc);
+                       RelOptInfo *other_childrel = rel->part_rels[child_no];
+
+                       /*
+                        * Keep track of child with lowest number of
rows but higher than the
+                        * that of the child being inserted. Insert
the child before a
+                        * child with highest number of rows lesser than it.
+                        */
+                       if (child_rel->rows <= other_childrel->rows)
+                               insert_after = lc;
+                       else
+                               break;
+               }

Can we use quicksort instead of a hand-coded insertion sort?

+ if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

+ * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse. You can just add another case for the new
path type.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys? Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL? I assume most of the logic is common.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow. It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work. I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier. Not sure exactly where.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes. It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary. My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that. There might
be some work that needs to or could be done on the comments, too. For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z". They just say that we do
it. To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#29)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Feb 2, 2017 at 2:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 2, 2017 at 7:32 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
on the latest code.

Maybe not surprisingly given how fast things are moving around here
these days, this needs a rebase.

Apart from that, my overall comment on this patch is that it's huge:

37 files changed, 7993 insertions(+), 287 deletions(-)

Now, more than half of that is regression test cases and their output,
which you will certainly be asked to pare down in any version of this
intended for commit.

Yes. I will work on that once the design and implementation is in
acceptable state. I have already toned down testcases compared to the
previous patch.

But even excluding those, it's still a fairly
patch:

30 files changed, 2783 insertions(+), 272 deletions(-)

I think the reason this is so large is because there's a fair amount
of refactoring work that has been done as a precondition of the actual
meat of the patch, and no attempt has been made to separate the
refactoring work from the main body of the patch. I think that's
something that needs to be done. If you look at the way Amit Langote
submitted the partitioning patches and the follow-up bug fixes, he had
a series of patches 0001-blah, 0002-quux, etc. generated using
format-patch. Each patch had its own commit message written by him
explaining the purpose of that patch, links to relevant discussion,
etc. If you can separate this into more digestible chunks it will be
easier to get committed.

I will try to break down the patch into smaller, easy-to-review,
logically cohesive patches.

Other questions/comments:

Why does find_partition_scheme need to copy the partition bound
information instead of just pointing to it? Amit went to some trouble
to make sure that this can't change under us while we hold a lock on
the relation, and we'd better hold a lock on the relation if we're
planning a query against it.

PartitionScheme is shared across multiple relations, join or base,
partitioned similarly. Obviously it can't and does not need to point
partition bound informations (which should all be same) of all those
base relations. O the the face of it, it looks weird that it points to
only one of them, mostly the one which it encounters first. But, since
it's going to be the same partition bound information, it doesn't
matter which one. So, I think, we can point of any one of those. Do
you agree?

I think the PartitionScheme stuff should live in the optimizer rather
that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
eventually need a new file in the optimizer just for partitioning
stuff, but I'm not sure about that yet.

I placed PartitionScheme stuff in partition.c because most of the
functions and structures in partition.c are not visible outside that
file. But I will try again to locate PartitionScheme to optimizer.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable. You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

I considered this option and discarded it, since not all partitioned
relations will have OIDs for partitions e.g. partitioned joins will
not have OIDs for their partitions. But now that I think of it, we
should probably store those OIDs just for the base relation and leave
them unused for non-base relations just like other base relation
specific fields in RelOptInfo.

+                * For two partitioned tables with the same
partitioning scheme, it is
+                * assumed that the Oids of matching partitions from
both the tables
+                * are placed at the same position in the array of
partition oids in

Rather than saying that we assume this, you should say why it has to
be true. (If it doesn't have to be true, we shouldn't assume it.)

Will take care of this.

+                * join relations. Partition tables should have same
layout as the
+                * parent table and hence should not need any
translation. But rest of

The same attributes have to be present with the same types, but they
can be rearranged. This comment seems to imply the contrary.

Hmm, will take care of this.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+1. Will take care of this. Does "representative_partitions_fraction"
or "sample_partition_fraction" look like a good GUC name? Any other
suggestions?

+               /*
+                * Add this relation to the list of samples ordered by
the increasing
+                * number of rows at appropriate place.
+                */
+               foreach (lc, ordered_child_nos)
+               {
+                       int     child_no = lfirst_int(lc);
+                       RelOptInfo *other_childrel = rel->part_rels[child_no];
+
+                       /*
+                        * Keep track of child with lowest number of
rows but higher than the
+                        * that of the child being inserted. Insert
the child before a
+                        * child with highest number of rows lesser than it.
+                        */
+                       if (child_rel->rows <= other_childrel->rows)
+                               insert_after = lc;
+                       else
+                               break;
+               }

Can we use quicksort instead of a hand-coded insertion sort?

I guess so, if I write comparison functions, which shouldn't be a
problem. Will try that.

+ if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

+ * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse. You can just add another case for the new
path type.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys? Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL? I assume most of the logic is common.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow. It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work. I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier. Not sure exactly where.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes. It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary. My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that. There might
be some work that needs to or could be done on the comments, too. For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z". They just say that we do
it. To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#30)
Re: Partition-wise join for join between (declaratively) partitioned tables

sent the previous mail before completing my reply. Sorry for that.
Here's the rest of the reply.

+ if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

+ * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

Will take care of this.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse. You can just add another case for the new
path type.

Will take care of this.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys? Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL? I assume most of the logic is common.

Agreed. will take care of this.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow. It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work. I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier. Not sure exactly where.

I agree with this. Probably we should add a path tree mutator before
SS_identify_outer_params() to replace any Partition*Paths with
Merge/Append paths. The mutator will create paths for child-joins
within temporary memory context, copy the relevant paths and create
Merge/Append paths. There are two problems there 1. We have to write
code to copy paths; most of the paths would be flat copy but custom
scan paths might have some unexpected problems. 2. There will be many
surviving PartitionPaths, and all the corresponding child paths would
need copying and consume memory. In order to reduce that consumption,
we have run this mutator after set_cheapest() in subquery_planner();
but then nothing interesting happens between that and create_plan().
Expanding PartitionPaths during create_plan() does not need any path
copying and we expand only the PartitionPaths which will be converted
to plans. That does save a lot of memory; the reason why we defer
creating paths for child-joins.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes. It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary. My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that. There might
be some work that needs to or could be done on the comments, too. For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z". They just say that we do
it. To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

Specifically about add_paths_to_append_rel(), what do you expect the
comment to say? It would be obvious why we split that functionality
into a separate function: in fact, we don't necessarily explain why
certain code resides in a separate function in the comments. I think,
that particular comment (or for that matter other such comments in the
optimizer) can be removed altogether, since it just writes the
function names as an "English" sentence. I sometimes find those
comments useful, because I can read just those comments and forget
about the code, making comprehension easy. If highlighting is ON, your
brain habitually ignores the non-comment portions when required. I am
open to suggestions.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#30)
11 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Per your suggestion I have split the patch into many smaller patches.

0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patch

These four refactor existing code.

0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patch -- just tests, they fail
0008-Partition-wise-join.patch -- actual patch implementing
partition-wise join, still some tests fail\

0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch

The last three patches change existing code to expect child(-join)
relations where they were not expected earlier.

Each patch has summary of the changes.

Partition-wise join for multi-level partitioned tables is not covered
by these patches. I will post those patches soon.

Other questions/comments:

Why does find_partition_scheme need to copy the partition bound
information instead of just pointing to it? Amit went to some trouble
to make sure that this can't change under us while we hold a lock on
the relation, and we'd better hold a lock on the relation if we're
planning a query against it.

PartitionScheme is shared across multiple relations, join or base,
partitioned similarly. Obviously it can't and does not need to point
partition bound informations (which should all be same) of all those
base relations. O the the face of it, it looks weird that it points to
only one of them, mostly the one which it encounters first. But, since
it's going to be the same partition bound information, it doesn't
matter which one. So, I think, we can point of any one of those. Do
you agree?

Instead of copying PartitionBoundInfo, used pointer of the first
encountered one.

I think the PartitionScheme stuff should live in the optimizer rather
that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
eventually need a new file in the optimizer just for partitioning
stuff, but I'm not sure about that yet.

I placed PartitionScheme stuff in partition.c because most of the
functions and structures in partition.c are not visible outside that
file. But I will try again to locate PartitionScheme to optimizer.

Moved the code as per your suggestion.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable. You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

I considered this option and discarded it, since not all partitioned
relations will have OIDs for partitions e.g. partitioned joins will
not have OIDs for their partitions. But now that I think of it, we
should probably store those OIDs just for the base relation and leave
them unused for non-base relations just like other base relation
specific fields in RelOptInfo.

Changed as per your suggestions.

+                * For two partitioned tables with the same
partitioning scheme, it is
+                * assumed that the Oids of matching partitions from
both the tables
+                * are placed at the same position in the array of
partition oids in

Rather than saying that we assume this, you should say why it has to
be true. (If it doesn't have to be true, we shouldn't assume it.)

Will take care of this.

Done. Please check.

+                * join relations. Partition tables should have same
layout as the
+                * parent table and hence should not need any
translation. But rest of

The same attributes have to be present with the same types, but they
can be rearranged. This comment seems to imply the contrary.

Hmm, will take care of this.

Done.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+1. Will take care of this. Does "representative_partitions_fraction"
or "sample_partition_fraction" look like a good GUC name? Any other
suggestions?

used "sample_partition_fraction" for now. Suggestions are welcome.

+               /*
+                * Add this relation to the list of samples ordered by
the increasing
+                * number of rows at appropriate place.
+                */
+               foreach (lc, ordered_child_nos)
+               {
+                       int     child_no = lfirst_int(lc);
+                       RelOptInfo *other_childrel = rel->part_rels[child_no];
+
+                       /*
+                        * Keep track of child with lowest number of
rows but higher than the
+                        * that of the child being inserted. Insert
the child before a
+                        * child with highest number of rows lesser than it.
+                        */
+                       if (child_rel->rows <= other_childrel->rows)
+                               insert_after = lc;
+                       else
+                               break;
+               }

Can we use quicksort instead of a hand-coded insertion sort?

I guess so, if I write comparison functions, which shouldn't be a
problem. Will try that.

Done.

+ if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

That code is not required any more.

+ * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

Done.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse. You can just add another case for the new
path type.

Done.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys? Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL? I assume most of the logic is common.

Combined those into a single function.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow. It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work. I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier. Not sure exactly where.

Pasting my previous replies here to keep everything in one mail.

I agree with this. Probably we should add a path tree mutator before
SS_identify_outer_params() to replace any Partition*Paths with
Merge/Append paths. The mutator will create paths for child-joins
within temporary memory context, copy the relevant paths and create
Merge/Append paths. There are two problems there 1. We have to write
code to copy paths; most of the paths would be flat copy but custom
scan paths might have some unexpected problems. 2. There will be many
surviving PartitionPaths, and all the corresponding child paths would
need copying and consume memory. In order to reduce that consumption,
we have run this mutator after set_cheapest() in subquery_planner();
but then nothing interesting happens between that and create_plan().
Expanding PartitionPaths during create_plan() does not need any path
copying and we expand only the PartitionPaths which will be converted
to plans. That does save a lot of memory; the reason why we defer
creating paths for child-joins.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes. It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary. My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that. There might
be some work that needs to or could be done on the comments, too. For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z". They just say that we do
it. To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

Specifically about add_paths_to_append_rel(), what do you expect the
comment to say? It would be obvious why we split that functionality
into a separate function: in fact, we don't necessarily explain why
certain code resides in a separate function in the comments. I think,
that particular comment (or for that matter other such comments in the
optimizer) can be removed altogether, since it just writes the
function names as an "English" sentence. I sometimes find those
comments useful, because I can read just those comments and forget
about the code, making comprehension easy. If highlighting is ON, your
brain habitually ignores the non-comment portions when required. I am
open to suggestions.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Refactor-set_append_rel_pathlist.patchapplication/octet-stream; name=0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patchapplication/octet-stream; name=0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patchapplication/octet-stream; name=0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patchapplication/octet-stream; name=0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patchapplication/octet-stream; name=0005-Add-function-find_param_path_info.patch
0006-Canonical-partition-scheme.patchapplication/octet-stream; name=0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patchapplication/octet-stream; name=0007-Partition-wise-join-tests.patch
0008-Partition-wise-join.patchapplication/octet-stream; name=0008-Partition-wise-join.patch
0009-Adjust-join-related-to-code-to-accept-child-relation.patchapplication/octet-stream; name=0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patchapplication/octet-stream; name=0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patchapplication/octet-stream; name=0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
#33Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#32)
11 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Fixed a problem with the way qsort was being used in the earlier set
of patches. Attached PFA the set of patches with that fixed.

On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Per your suggestion I have split the patch into many smaller patches.

0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patch

These four refactor existing code.

0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patch -- just tests, they fail
0008-Partition-wise-join.patch -- actual patch implementing
partition-wise join, still some tests fail\

0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch

The last three patches change existing code to expect child(-join)
relations where they were not expected earlier.

Each patch has summary of the changes.

Partition-wise join for multi-level partitioned tables is not covered
by these patches. I will post those patches soon.

Other questions/comments:

Why does find_partition_scheme need to copy the partition bound
information instead of just pointing to it? Amit went to some trouble
to make sure that this can't change under us while we hold a lock on
the relation, and we'd better hold a lock on the relation if we're
planning a query against it.

PartitionScheme is shared across multiple relations, join or base,
partitioned similarly. Obviously it can't and does not need to point
partition bound informations (which should all be same) of all those
base relations. O the the face of it, it looks weird that it points to
only one of them, mostly the one which it encounters first. But, since
it's going to be the same partition bound information, it doesn't
matter which one. So, I think, we can point of any one of those. Do
you agree?

Instead of copying PartitionBoundInfo, used pointer of the first
encountered one.

I think the PartitionScheme stuff should live in the optimizer rather
that src/backend/catalog/partition.c. Maybe plancat.c? Perhaps we
eventually need a new file in the optimizer just for partitioning
stuff, but I'm not sure about that yet.

I placed PartitionScheme stuff in partition.c because most of the
functions and structures in partition.c are not visible outside that
file. But I will try again to locate PartitionScheme to optimizer.

Moved the code as per your suggestion.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable. You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

I considered this option and discarded it, since not all partitioned
relations will have OIDs for partitions e.g. partitioned joins will
not have OIDs for their partitions. But now that I think of it, we
should probably store those OIDs just for the base relation and leave
them unused for non-base relations just like other base relation
specific fields in RelOptInfo.

Changed as per your suggestions.

+                * For two partitioned tables with the same
partitioning scheme, it is
+                * assumed that the Oids of matching partitions from
both the tables
+                * are placed at the same position in the array of
partition oids in

Rather than saying that we assume this, you should say why it has to
be true. (If it doesn't have to be true, we shouldn't assume it.)

Will take care of this.

Done. Please check.

+                * join relations. Partition tables should have same
layout as the
+                * parent table and hence should not need any
translation. But rest of

The same attributes have to be present with the same types, but they
can be rearranged. This comment seems to imply the contrary.

Hmm, will take care of this.

Done.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+1. Will take care of this. Does "representative_partitions_fraction"
or "sample_partition_fraction" look like a good GUC name? Any other
suggestions?

used "sample_partition_fraction" for now. Suggestions are welcome.

+               /*
+                * Add this relation to the list of samples ordered by
the increasing
+                * number of rows at appropriate place.
+                */
+               foreach (lc, ordered_child_nos)
+               {
+                       int     child_no = lfirst_int(lc);
+                       RelOptInfo *other_childrel = rel->part_rels[child_no];
+
+                       /*
+                        * Keep track of child with lowest number of
rows but higher than the
+                        * that of the child being inserted. Insert
the child before a
+                        * child with highest number of rows lesser than it.
+                        */
+                       if (child_rel->rows <= other_childrel->rows)
+                               insert_after = lc;
+                       else
+                               break;
+               }

Can we use quicksort instead of a hand-coded insertion sort?

I guess so, if I write comparison functions, which shouldn't be a
problem. Will try that.

Done.

+ if (bms_num_members(outer_relids) > 1)

Seems like bms_get_singleton_member could be used.

That code is not required any more.

+ * Partitioning scheme in join relation indicates a possibilty that the

Spelling.

Done.

There seems to be no reason for create_partition_plan to be separated
from create_plan_recurse. You can just add another case for the new
path type.

Done.

Why does create_partition_join_path need to be separate from
create_partition_join_path_with_pathkeys? Couldn't that be combined
into a single function with a pathkeys argument that might sometimes
be NIL? I assume most of the logic is common.

Combined those into a single function.

From a sort of theoretical standpoint, the biggest danger of this
patch seems to be that by deferring path creation until a later stage
than normal, we could miss some important processing.
subquery_planner() does a lot of stuff after
expand_inherited_tables(); if any of those things, especially the ones
that happen AFTER path generation, have an effect on the paths, then
this code needs to compensate for those changes somehow. It seems
like having the planning of unsampled children get deferred until
create_plan() time is awfully surprising; here we are creating the
plan and suddenly what used to be a straightforward path->plan
translation is running around doing major planning work. I can't
entirely justify it, but I somehow have a feeling that work ought to
be moved earlier. Not sure exactly where.

Pasting my previous replies here to keep everything in one mail.

I agree with this. Probably we should add a path tree mutator before
SS_identify_outer_params() to replace any Partition*Paths with
Merge/Append paths. The mutator will create paths for child-joins
within temporary memory context, copy the relevant paths and create
Merge/Append paths. There are two problems there 1. We have to write
code to copy paths; most of the paths would be flat copy but custom
scan paths might have some unexpected problems. 2. There will be many
surviving PartitionPaths, and all the corresponding child paths would
need copying and consume memory. In order to reduce that consumption,
we have run this mutator after set_cheapest() in subquery_planner();
but then nothing interesting happens between that and create_plan().
Expanding PartitionPaths during create_plan() does not need any path
copying and we expand only the PartitionPaths which will be converted
to plans. That does save a lot of memory; the reason why we defer
creating paths for child-joins.

This is not really a full review, mostly because I can't easily figure
out the motivation for all of the changes the patch makes. It makes a
lot of changes in a lot of places, and it's not really very easy to
understand why those changes are necessary. My comments above about
splitting the patch into a series of patches that can potentially be
reviewed and applied independently, with the main patch being the last
in the series, are a suggestion as to how to tackle that. There might
be some work that needs to or could be done on the comments, too. For
example, the patch splits out add_paths_to_append_rel from
set_append_rel_pathlist, but the comments don't say anything helpful
like "we need to do X after Y, because Z". They just say that we do
it. To some extent I think the comments in the optimizer have that
problem generally, so it's not entirely the fault of this patch;
still, the lack of those explanations makes the code reorganization
harder to follow, and might confuse future patch authors, too.

Specifically about add_paths_to_append_rel(), what do you expect the
comment to say? It would be obvious why we split that functionality
into a separate function: in fact, we don't necessarily explain why
certain code resides in a separate function in the comments. I think,
that particular comment (or for that matter other such comments in the
optimizer) can be removed altogether, since it just writes the
function names as an "English" sentence. I sometimes find those
comments useful, because I can read just those comments and forget
about the code, making comprehension easy. If highlighting is ON, your
brain habitually ignores the non-comment portions when required. I am
open to suggestions.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Refactor-set_append_rel_pathlist.patchapplication/octet-stream; name=0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patchapplication/octet-stream; name=0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patchapplication/octet-stream; name=0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patchapplication/octet-stream; name=0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patchapplication/octet-stream; name=0005-Add-function-find_param_path_info.patch
0006-Canonical-partition-scheme.patchapplication/octet-stream; name=0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patchapplication/octet-stream; name=0007-Partition-wise-join-tests.patch
0008-Partition-wise-join.patchapplication/octet-stream; name=0008-Partition-wise-join.patch
0009-Adjust-join-related-to-code-to-accept-child-relation.patchapplication/octet-stream; name=0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patchapplication/octet-stream; name=0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patchapplication/octet-stream; name=0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
#34Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#33)
14 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Here is set of patches with support for partition-wise join between
multi-level partitioned tables.

On Fri, Feb 10, 2017 at 11:19 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Fixed a problem with the way qsort was being used in the earlier set
of patches. Attached PFA the set of patches with that fixed.

This fix is included.

On Thu, Feb 9, 2017 at 4:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Per your suggestion I have split the patch into many smaller patches.

0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patch

These four refactor existing code.

0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patch -- just tests, they fail
0008-Partition-wise-join.patch -- actual patch implementing
partition-wise join, still some tests fail\

0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch

patch to translate partition hierarchy into inheritance hierarchy
without flattening

0012-Multi-level-partitioned-table-expansion.patch

patches for multi-level partition-wise join support

0013-Multi-level-partition-wise-join-tests.patch
0014-Multi-level-partition-wise-join-support.patch

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Refactor-set_append_rel_pathlist.patchapplication/octet-stream; name=0001-Refactor-set_append_rel_pathlist.patch
0002-Refactor-make_join_rel.patchapplication/octet-stream; name=0002-Refactor-make_join_rel.patch
0003-Refactor-adjust_appendrel_attrs.patchapplication/octet-stream; name=0003-Refactor-adjust_appendrel_attrs.patch
0004-Refactor-build_join_rel.patchapplication/octet-stream; name=0004-Refactor-build_join_rel.patch
0005-Add-function-find_param_path_info.patchapplication/octet-stream; name=0005-Add-function-find_param_path_info.patch
0006-Canonical-partition-scheme.patchapplication/octet-stream; name=0006-Canonical-partition-scheme.patch
0007-Partition-wise-join-tests.patchapplication/octet-stream; name=0007-Partition-wise-join-tests.patch
0008-Partition-wise-join-implementation.patchapplication/octet-stream; name=0008-Partition-wise-join-implementation.patch
0009-Adjust-join-related-to-code-to-accept-child-relation.patchapplication/octet-stream; name=0009-Adjust-join-related-to-code-to-accept-child-relation.patch
0010-Parameterized-path-fixes.patchapplication/octet-stream; name=0010-Parameterized-path-fixes.patch
0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patchapplication/octet-stream; name=0011-Use-IS_JOIN_REL-instead-of-RELOPT_JOINREL.patch
0012-Multi-level-partitioned-table-expansion.patchapplication/octet-stream; name=0012-Multi-level-partitioned-table-expansion.patch
0013-Multi-level-partition-wise-join-tests.patchapplication/octet-stream; name=0013-Multi-level-partition-wise-join-tests.patch
0014-Multi-level-partition-wise-join-support.patchapplication/octet-stream; name=0014-Multi-level-partition-wise-join-support.patch
#35Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#30)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Feb 6, 2017 at 3:34 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

PartitionScheme is shared across multiple relations, join or base,
partitioned similarly. Obviously it can't and does not need to point
partition bound informations (which should all be same) of all those
base relations. O the the face of it, it looks weird that it points to
only one of them, mostly the one which it encounters first. But, since
it's going to be the same partition bound information, it doesn't
matter which one. So, I think, we can point of any one of those. Do
you agree?

Yes.

The fact that set_append_rel_size needs to reopen the relation to
extract a few more bits of information is not desirable. You need to
fish this information through in some other way; for example, you
could have get_relation_info() stash the needed bits in the
RelOptInfo.

I considered this option and discarded it, since not all partitioned
relations will have OIDs for partitions e.g. partitioned joins will
not have OIDs for their partitions. But now that I think of it, we
should probably store those OIDs just for the base relation and leave
them unused for non-base relations just like other base relation
specific fields in RelOptInfo.

Right.

FRACTION_PARTS_TO_PLAN seems like it should be a GUC.

+1. Will take care of this. Does "representative_partitions_fraction"
or "sample_partition_fraction" look like a good GUC name? Any other
suggestions?

I like the second one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#26)
5 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Robert and I discussed this in more detail. Path creation code may
allocate objects other than paths. postgres_fdw, for example,
allocates character array to hold the name of relation being
pushed-down. When the temporary context gets zapped after creating
paths for a given child-join, those other objects also gets thrown
away. Attached patch has implemented the idea that came out of the
discussion.

We create a memory context for holding paths at the time of creating
PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
macro makePathNode() which allocates the memory for given type of path
from this context. Every create_*_path function has been changed to
use this macro instead of makeNode(). In standard_planner(), at the
end of planning we destroy the memory context freeing all the paths
allocated. While creating a plan node, planner copies everything
required by the plan from the path, so the path is not needed any
more. So, freeing corresponding memory should not have any adverse
effects.

Most of the create_*_path() functions accept root as an argument, thus
the temporary path context is available through root->glob everywhere.
An exception is create_append_path() which does not accept root as an
argument. The patch changes create_append_path() and its callers like
set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
argument. Ideally paths are not required after creating plan, so we
should be
able to free the context right after the call to create_plan(). But we
need dummy paths while creating flat rtable in
set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
the path context at the end of planning cycle. Now that we are
allocating all the paths in a different memory context, it doesn't
make sense to switch context in mark_dummy_rel().

0001 patch implements the idea described above.
0002 patch adds instrumentation to measure memory consumed in
standard_planner() call.
0003 patch adds a GUC zap_paths to enable/disable destroying path context.
The last two patches are for testing only.

Attached also find the SQL script and its output showing the memory
saved. For a 5 way self-join of pg_class, the total memory consumed in
standard_planner() is 760K without patch and with patch it comes down
to 713K, saving 47K memory otherwise occupied by paths. It looks like
something useful even without partition-wise joins.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Free-up-memory-consumed-by-the-paths.patchapplication/octet-stream; name=0001-Free-up-memory-consumed-by-the-paths.patch
0002-Patch-to-measure-memory-used-in-CurrentMemoryContext.patchapplication/octet-stream; name=0002-Patch-to-measure-memory-used-in-CurrentMemoryContext.patch
0003-GUC-zap_path-to-enable-freeing-memory-consumed-by-pa.patchapplication/octet-stream; name=0003-GUC-zap_path-to-enable-freeing-memory-consumed-by-pa.patch
path_memory.outapplication/octet-stream; name=path_memory.out
path_memory.sqlapplication/octet-stream; name=path_memory.sql
#37Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#36)
3 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Updated 0001 patch with some more comments. Attaching all the patches
for quick access.

On Wed, Mar 1, 2017 at 2:26 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Robert and I discussed this in more detail. Path creation code may
allocate objects other than paths. postgres_fdw, for example,
allocates character array to hold the name of relation being
pushed-down. When the temporary context gets zapped after creating
paths for a given child-join, those other objects also gets thrown
away. Attached patch has implemented the idea that came out of the
discussion.

We create a memory context for holding paths at the time of creating
PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
macro makePathNode() which allocates the memory for given type of path
from this context. Every create_*_path function has been changed to
use this macro instead of makeNode(). In standard_planner(), at the
end of planning we destroy the memory context freeing all the paths
allocated. While creating a plan node, planner copies everything
required by the plan from the path, so the path is not needed any
more. So, freeing corresponding memory should not have any adverse
effects.

Most of the create_*_path() functions accept root as an argument, thus
the temporary path context is available through root->glob everywhere.
An exception is create_append_path() which does not accept root as an
argument. The patch changes create_append_path() and its callers like
set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
argument. Ideally paths are not required after creating plan, so we
should be
able to free the context right after the call to create_plan(). But we
need dummy paths while creating flat rtable in
set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
the path context at the end of planning cycle. Now that we are
allocating all the paths in a different memory context, it doesn't
make sense to switch context in mark_dummy_rel().

0001 patch implements the idea described above.
0002 patch adds instrumentation to measure memory consumed in
standard_planner() call.
0003 patch adds a GUC zap_paths to enable/disable destroying path context.
The last two patches are for testing only.

Attached also find the SQL script and its output showing the memory
saved. For a 5 way self-join of pg_class, the total memory consumed in
standard_planner() is 760K without patch and with patch it comes down
to 713K, saving 47K memory otherwise occupied by paths. It looks like
something useful even without partition-wise joins.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Free-up-memory-consumed-by-the-paths.patchapplication/octet-stream; name=0001-Free-up-memory-consumed-by-the-paths.patch
0002-Patch-to-measure-memory-used-in-CurrentMemoryContext.patchapplication/octet-stream; name=0002-Patch-to-measure-memory-used-in-CurrentMemoryContext.patch
0003-GUC-zap_path-to-enable-freeing-memory-consumed-by-pa.patchapplication/octet-stream; name=0003-GUC-zap_path-to-enable-freeing-memory-consumed-by-pa.patch
#38Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#36)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 1, 2017 at 3:56 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Robert and I discussed this in more detail. Path creation code may
allocate objects other than paths. postgres_fdw, for example,
allocates character array to hold the name of relation being
pushed-down. When the temporary context gets zapped after creating
paths for a given child-join, those other objects also gets thrown
away. Attached patch has implemented the idea that came out of the
discussion.

We create a memory context for holding paths at the time of creating
PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
macro makePathNode() which allocates the memory for given type of path
from this context. Every create_*_path function has been changed to
use this macro instead of makeNode(). In standard_planner(), at the
end of planning we destroy the memory context freeing all the paths
allocated. While creating a plan node, planner copies everything
required by the plan from the path, so the path is not needed any
more. So, freeing corresponding memory should not have any adverse
effects.

Most of the create_*_path() functions accept root as an argument, thus
the temporary path context is available through root->glob everywhere.
An exception is create_append_path() which does not accept root as an
argument. The patch changes create_append_path() and its callers like
set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
argument. Ideally paths are not required after creating plan, so we
should be
able to free the context right after the call to create_plan(). But we
need dummy paths while creating flat rtable in
set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
the path context at the end of planning cycle. Now that we are
allocating all the paths in a different memory context, it doesn't
make sense to switch context in mark_dummy_rel().

0001 patch implements the idea described above.
0002 patch adds instrumentation to measure memory consumed in
standard_planner() call.
0003 patch adds a GUC zap_paths to enable/disable destroying path context.
The last two patches are for testing only.

Attached also find the SQL script and its output showing the memory
saved. For a 5 way self-join of pg_class, the total memory consumed in
standard_planner() is 760K without patch and with patch it comes down
to 713K, saving 47K memory otherwise occupied by paths. It looks like
something useful even without partition-wise joins.

Of course, that's not a lot, but the savings will be a lot better for
partition-wise joins. Do you have a set of patches for that feature
that apply on top of 0001?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#38)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

PFA the zip containing all the patches rebased on
56018bf26eec1a0b4bf20303c98065a8eb1b0c5d and contain the patch to free
memory consumed by paths using a separate path context.

There are some more changes wrt earlier set of patches
1. Since we don't need a separate context for planning for each
child_join, changed code in create_partition_join_plan() to not do
that. The function collects all child_join paths into merge/append
path and calls create_plan_recurse() on that path instead of
converting each child_join path to plan one at a time.

2. Changed optimizer/README and some comments referring to temporary
memory context, since we do not use that anymore.

3. reparameterize_path_by_child() is fixed to translate the merge and
hash clause in Hash/Merge path.

On Thu, Mar 9, 2017 at 6:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 1, 2017 at 3:56 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Robert and I discussed this in more detail. Path creation code may
allocate objects other than paths. postgres_fdw, for example,
allocates character array to hold the name of relation being
pushed-down. When the temporary context gets zapped after creating
paths for a given child-join, those other objects also gets thrown
away. Attached patch has implemented the idea that came out of the
discussion.

We create a memory context for holding paths at the time of creating
PlannerGlobal and save it in PlannerGlobal. The patch introduces a new
macro makePathNode() which allocates the memory for given type of path
from this context. Every create_*_path function has been changed to
use this macro instead of makeNode(). In standard_planner(), at the
end of planning we destroy the memory context freeing all the paths
allocated. While creating a plan node, planner copies everything
required by the plan from the path, so the path is not needed any
more. So, freeing corresponding memory should not have any adverse
effects.

Most of the create_*_path() functions accept root as an argument, thus
the temporary path context is available through root->glob everywhere.
An exception is create_append_path() which does not accept root as an
argument. The patch changes create_append_path() and its callers like
set_dummy_rel_pathlist(), mark_dummy_rel() to accept root as an
argument. Ideally paths are not required after creating plan, so we
should be
able to free the context right after the call to create_plan(). But we
need dummy paths while creating flat rtable in
set_plan_references()->add_rtes_to_flat_rtable(). We used to So free
the path context at the end of planning cycle. Now that we are
allocating all the paths in a different memory context, it doesn't
make sense to switch context in mark_dummy_rel().

0001 patch implements the idea described above.
0002 patch adds instrumentation to measure memory consumed in
standard_planner() call.
0003 patch adds a GUC zap_paths to enable/disable destroying path context.
The last two patches are for testing only.

Attached also find the SQL script and its output showing the memory
saved. For a 5 way self-join of pg_class, the total memory consumed in
standard_planner() is 760K without patch and with patch it comes down
to 713K, saving 47K memory otherwise occupied by paths. It looks like
something useful even without partition-wise joins.

Of course, that's not a lot, but the savings will be a lot better for
partition-wise joins. Do you have a set of patches for that feature
that apply on top of 0001?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v2.zipapplication/zip; name=pg_dp_join_patches_v2.zip
#40Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#39)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Mar 10, 2017 at 5:43 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

PFA the zip containing all the patches rebased on
56018bf26eec1a0b4bf20303c98065a8eb1b0c5d and contain the patch to free
memory consumed by paths using a separate path context.

Some very high-level thoughts based on a look through these patches:

In 0001, you've removed a comment about how GEQO needs special
handling, but it doesn't look as if you've made any compensating
change elsewhere. That seems unlikely to be correct. If GEQO needs
some paths to survive longer than others, how can it be right for this
code to create them all in the same context? Incidentally,
geqo_eval() seems to be an existing precedent for the idea of throwing
away paths and RelOptInfos, so we might want to use similar code for
partitionwise join.

0002 and 0003 look OK.

Probably 0004 is OK too, although that seems to be adding some
overhead to existing callers for the benefit of new ones. Might be
insignificant, though.

0005 looks OK, except that add_join_rel's definition is missing a
"static" qualifier. That's not just cosmetic; based on previous
expereince, this will break the BF.

0006 seems to be unnecessary; the new function isn't used in later patches.

Haven't looked at 0007 yet.

0008 is, as previously mentioned, more than we probably want to commit.

Haven't looked at 0009 yet.

0010 - 0012 seem to be various fixes which would need to be done
before or along with 0009, rather than afterward, so I am confused
about the ordering of those patches in the patch series.

The commit message for 0013 is a bit unclear about what it's doing,
although I can guess, a bit, based on the commit message for 0007.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#40)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Haven't looked at 0007 yet.

+               if (rel->part_scheme)
+               {
+                       int             cnt_parts;
+
+                       for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+                       {
+                               if (rel->part_oids[cnt_parts] ==
childRTE->relid)
+                               {
+                                       Assert(!rel->part_rels[cnt_parts]);
+                                       rel->part_rels[cnt_parts] = childrel;
+                               }
+                       }
+               }

It's not very appealing to use an O(n^2) algorithm here. I wonder if
we could arrange things so that inheritance expansion expands
partitions in the right order, and then we could just match them up
one-to-one. This would probably require an alternate version of
find_all_inheritors() that expand_inherited_rtentry() would call only
for partitioned tables. Failing that, another idea would be to use
qsort() or qsort_arg() to put the partitions in the right order.

+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !inhparent ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))

Maybe just don't call this function in the first place in the
!inhparent case, instead of passing down an argument that must always
be true.

+               /* Match the partition key types. */
+               for (cnt_pks = 0; cnt_pks < partnatts; cnt_pks++)
+               {
+                       /*
+                        * For types, it suffices to match the type
id, mod and collation;
+                        * len, byval and align are depedent on the first two.
+                        */
+                       if (part_key->partopfamily[cnt_pks] !=
part_scheme->partopfamily[cnt_pks] ||
+                               part_key->partopcintype[cnt_pks] !=
part_scheme->partopcintype[cnt_pks] ||
+                               part_key->parttypid[cnt_pks] !=
part_scheme->key_types[cnt_pks] ||
+                               part_key->parttypmod[cnt_pks] !=
part_scheme->key_typmods[cnt_pks] ||
+                               part_key->parttypcoll[cnt_pks] !=
part_scheme->key_collations[cnt_pks])
+                               break;
+               }

I think memcmp() might be better than a for-loop.

Overall this one looks pretty good and straightforward. Of course, I
haven't looked at the main act (0009) yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#41)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/03/14 9:17, Robert Haas wrote:

On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Haven't looked at 0007 yet.

Overall this one looks pretty good and straightforward.

In the following code of find_partition_scheme():

+	/* Did not find matching partition scheme. Create one. */
+	part_scheme = (PartitionScheme) palloc0(sizeof(PartitionSchemeData));
+
+	/* Copy partition bounds/lists. */
+	part_scheme->nparts = part_desc->nparts;
+	part_scheme->strategy = part_key->strategy;
+	part_scheme->boundinfo = part_desc->boundinfo;
+
+	/* Store partition key information. */
+	part_scheme->partnatts = part_key->partnatts;
+
+	part_scheme->partopfamily = (Oid *) palloc(sizeof(Oid) * partnatts);
+	memcpy(part_scheme->partopfamily, part_key->partopfamily,
+		   sizeof(Oid) * partnatts);
+
+	part_scheme->partopcintype = (Oid *) palloc(sizeof(Oid) * partnatts);
+	memcpy(part_scheme->partopcintype, part_key->partopcintype,
+		   sizeof(Oid) * partnatts);
+
+	part_scheme->key_types = (Oid *) palloc(sizeof(Oid) * partnatts);
+	memcpy(part_scheme->key_types, part_key->parttypid,
+		   sizeof(Oid) * partnatts);
+
+	part_scheme->key_typmods = (int32 *) palloc(sizeof(int32) * partnatts);
+	memcpy(part_scheme->key_typmods, part_key->parttypmod,
+		   sizeof(int32) * partnatts);
+
+	part_scheme->key_collations = (Oid *) palloc(sizeof(Oid) * partnatts);
+	memcpy(part_scheme->key_collations, part_key->parttypcoll,
+		   sizeof(Oid) * partnatts);

Couldn't we avoid the memcpy() on individual members of part_key? After
all, RelationData.rd_partkey is guarded just like rd_partdesc by
relcache.c in face of invalidations (see keep_partkey logic in
RelationClearRelation).

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#40)
Re: Partition-wise join for join between (declaratively) partitioned tables

Thanks for the review.

Some very high-level thoughts based on a look through these patches:

In 0001, you've removed a comment about how GEQO needs special
handling, but it doesn't look as if you've made any compensating
change elsewhere. That seems unlikely to be correct. If GEQO needs
some paths to survive longer than others, how can it be right for this
code to create them all in the same context?

Thanks for pointing that out. I have replaced the code and the
comments back. There was another issue that the temporary paths
created by geqo will not be freed when geqo moves one genetic string
to next genetic string (or what it calls as tour: an list of relation
to be joined in a given order). To fix this, we need to set the path
context to the temporary context of geqo inside geqo_eval() before
calling gimme_tree() and reset it later. That way the temporary paths
are also created in the temporary memory context of geqo. Fixed in the
patch.

Incidentally,
geqo_eval() seems to be an existing precedent for the idea of throwing
away paths and RelOptInfos, so we might want to use similar code for
partitionwise join.

There are some differences in what geqo does and what partition-wise
needs to do. geqo tries many joining orders each one in a separate
temporary context. The way geqo slices the work, every slice produces
a full plan. For partition-wise join I do not see a way to slice the
work such that the whole path and corresponding RelOptInfos come from
the same slice. So, we can't use the same method as GEQO.

It's worth noticing that paths are created twice for the cheapest
joining order that it finds: once in the trial phase and second time
when the final plan is created. The second time, the paths,
RelOptInfos, expressions used by the final plan are in the context as
the plan.

0002 and 0003 look OK.

Probably 0004 is OK too, although that seems to be adding some
overhead to existing callers for the benefit of new ones. Might be
insignificant, though.

Yes, the overhead is to add and extract the appinfo from a list when
there is only one appinfo. We may optimize it by passing appinfo
directly when there's only one and pass list when there are more, but
that is complicating the code unnecessarily. The overhead seems to be
worth the cost to keep the code simpler.

0005 looks OK, except that add_join_rel's definition is missing a
"static" qualifier. That's not just cosmetic; based on previous
expereince, this will break the BF.

Thanks for pointing it out. Done.

0006 seems to be unnecessary; the new function isn't used in later patches.

It's required by 0011 - reparameterize_path_by_child(). BTW, I need to
know whether reparameterize_path_by_child() looks good, so that I can
complete it by adding support for all kinds of path in that function.

Haven't looked at 0007 yet.

0008 is, as previously mentioned, more than we probably want to commit.

I agree, and I will work on that.

Haven't looked at 0009 yet.

0010 - 0012 seem to be various fixes which would need to be done
before or along with 0009, rather than afterward, so I am confused
about the ordering of those patches in the patch series.

They are needed only when we have 0009. But when those are clubbed
with 0009, it makes 0009 review difficult as the code for those fixes
mixes with the code for partition-wise support. So, I have separated
those out into patches categorized by functionality. Reviewer may then
apply 0009 and see what failures each of the changes in 0010-0012
fixes, if required. They need to be committed along-with 0009.

The commit message for 0013 is a bit unclear about what it's doing,
although I can guess, a bit, based on the commit message for 0007.

This is preparatory patch for 0015 which supports partition-wise join
for multi-level partitioned tables. We have discussed about
partition-wise join support for multi-level partitioned tables in [1]/messages/by-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV=kB8Akkqg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company.
We may decide to postpone patches 0013-0015 to v11, if this gets too
much for v10.

[1]: /messages/by-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV=kB8Akkqg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#41)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 14, 2017 at 5:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Haven't looked at 0007 yet.

+               if (rel->part_scheme)
+               {
+                       int             cnt_parts;
+
+                       for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+                       {
+                               if (rel->part_oids[cnt_parts] ==
childRTE->relid)
+                               {
+                                       Assert(!rel->part_rels[cnt_parts]);
+                                       rel->part_rels[cnt_parts] = childrel;
+                               }
+                       }
+               }

It's not very appealing to use an O(n^2) algorithm here. I wonder if
we could arrange things so that inheritance expansion expands
partitions in the right order, and then we could just match them up
one-to-one. This would probably require an alternate version of
find_all_inheritors() that expand_inherited_rtentry() would call only
for partitioned tables.

That seems a much better solution, but
1. Right now when we expand a multi-level partitioned table, we
include indirect partitions as direct children in inheritance
hierachy. part_rels array OTOH should correspond to the partitioning
scheme and should hold RelOptInfos of direct partitions. 0013 patch
fixes that to include only direct partitions as direct children
preserving partitioning hierarchy in the inheritance hierarchy. That
patch right now uses find_inheritance_children() to get Oids of direct
partitions, but instead it could return rd_partdesc->oids in the form
of list; OIDs ordered same as the array. Once we do that, we should
expect the appinfos to appear in the same order as the
rd_partdesc->oids and so RelOptInfo::part_oids. We just need to make
sure that the order is preserved and assign part_rels as they appear
in that loop.

One would argue that we preserve the OIDs only for single-level
partitioned tables, but in expand_inheritance_rtentry(), if we want to
detect whether a relation is single-level partitioned or multi-level,
we need to look up its direct partitions to see if they are further
partitioned. That will look a bit ugly and will not be necessary once
we have 0013. In case we decide to defer multi-level partitioned table
changes to v11 and based on the progress in [1]/messages/by-id/2b0d42f2-3a53-763b-c9c2-47139e4b1c2e@lab.ntt.co.jp -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company, I will work on fixing
the order in which appinfos are created for single-level partitioned
tables.

Failing that, another idea would be to use
qsort() or qsort_arg() to put the partitions in the right order.

I didn't get this. I could not find documentation for qsort_arg(). Can
you please elaborate? I guess, if we fix expand_inheritance_rtentry()
we don't need this. It looks like we will change
expand_inheritance_rtentry() anyway.

+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !inhparent ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))

Maybe just don't call this function in the first place in the
!inhparent case, instead of passing down an argument that must always
be true.

The function serves a single place to re/set partitioning information.
It would set the partitioning information if the above three
conditions are met. Otherwise it would nullify that information. If we
decide not to call this function when !inhparent, we will need to
nullify the partitioning information outside of this function as well
as inside this function, duplicating the code.

+               /* Match the partition key types. */
+               for (cnt_pks = 0; cnt_pks < partnatts; cnt_pks++)
+               {
+                       /*
+                        * For types, it suffices to match the type
id, mod and collation;
+                        * len, byval and align are depedent on the first two.
+                        */
+                       if (part_key->partopfamily[cnt_pks] !=
part_scheme->partopfamily[cnt_pks] ||
+                               part_key->partopcintype[cnt_pks] !=
part_scheme->partopcintype[cnt_pks] ||
+                               part_key->parttypid[cnt_pks] !=
part_scheme->key_types[cnt_pks] ||
+                               part_key->parttypmod[cnt_pks] !=
part_scheme->key_typmods[cnt_pks] ||
+                               part_key->parttypcoll[cnt_pks] !=
part_scheme->key_collations[cnt_pks])
+                               break;
+               }

I think memcmp() might be better than a for-loop.

Done.

PFA patches.

[1]: /messages/by-id/2b0d42f2-3a53-763b-c9c2-47139e4b1c2e@lab.ntt.co.jp -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v3.zipapplication/zip; name=pg_dp_join_patches_v3.zip
#45Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#42)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 14, 2017 at 6:28 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/03/14 9:17, Robert Haas wrote:

On Mon, Mar 13, 2017 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Haven't looked at 0007 yet.

Overall this one looks pretty good and straightforward.

In the following code of find_partition_scheme():

+       /* Did not find matching partition scheme. Create one. */
+       part_scheme = (PartitionScheme) palloc0(sizeof(PartitionSchemeData));
+
+       /* Copy partition bounds/lists. */
+       part_scheme->nparts = part_desc->nparts;
+       part_scheme->strategy = part_key->strategy;
+       part_scheme->boundinfo = part_desc->boundinfo;
+
+       /* Store partition key information. */
+       part_scheme->partnatts = part_key->partnatts;
+
+       part_scheme->partopfamily = (Oid *) palloc(sizeof(Oid) * partnatts);
+       memcpy(part_scheme->partopfamily, part_key->partopfamily,
+                  sizeof(Oid) * partnatts);
+
+       part_scheme->partopcintype = (Oid *) palloc(sizeof(Oid) * partnatts);
+       memcpy(part_scheme->partopcintype, part_key->partopcintype,
+                  sizeof(Oid) * partnatts);
+
+       part_scheme->key_types = (Oid *) palloc(sizeof(Oid) * partnatts);
+       memcpy(part_scheme->key_types, part_key->parttypid,
+                  sizeof(Oid) * partnatts);
+
+       part_scheme->key_typmods = (int32 *) palloc(sizeof(int32) * partnatts);
+       memcpy(part_scheme->key_typmods, part_key->parttypmod,
+                  sizeof(int32) * partnatts);
+
+       part_scheme->key_collations = (Oid *) palloc(sizeof(Oid) * partnatts);
+       memcpy(part_scheme->key_collations, part_key->parttypcoll,
+                  sizeof(Oid) * partnatts);

Couldn't we avoid the memcpy() on individual members of part_key? After
all, RelationData.rd_partkey is guarded just like rd_partdesc by
relcache.c in face of invalidations (see keep_partkey logic in
RelationClearRelation).

This suggestion looks good to me. Incorporated in the latest set of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#43)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 14, 2017 at 8:04 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

In 0001, you've removed a comment about how GEQO needs special
handling, but it doesn't look as if you've made any compensating
change elsewhere. That seems unlikely to be correct. If GEQO needs
some paths to survive longer than others, how can it be right for this
code to create them all in the same context?

Thanks for pointing that out. I have replaced the code and the
comments back. There was another issue that the temporary paths
created by geqo will not be freed when geqo moves one genetic string
to next genetic string (or what it calls as tour: an list of relation
to be joined in a given order). To fix this, we need to set the path
context to the temporary context of geqo inside geqo_eval() before
calling gimme_tree() and reset it later. That way the temporary paths
are also created in the temporary memory context of geqo. Fixed in the
patch.

Yeah, that looks better.

Incidentally,
geqo_eval() seems to be an existing precedent for the idea of throwing
away paths and RelOptInfos, so we might want to use similar code for
partitionwise join.

There are some differences in what geqo does and what partition-wise
needs to do. geqo tries many joining orders each one in a separate
temporary context. The way geqo slices the work, every slice produces
a full plan. For partition-wise join I do not see a way to slice the
work such that the whole path and corresponding RelOptInfos come from
the same slice. So, we can't use the same method as GEQO.

What I was thinking about was the use of this technique for getting
rid of joinrels:

root->join_rel_list = list_truncate(root->join_rel_list,
savelength);
root->join_rel_hash = savehash;

makePathNode() serves to segregate paths into a separate memory
context that can then be destroyed, but as you point out, the path
lists are still hanging around, and so are the RelOptInfo nodes. It
seems to me we could do a lot better using this technique. Suppose we
jigger things so that the List objects created by add_path go into
path_cxt, and so that RelOptInfo nodes also go into path_cxt. Then
when we blow up path_cxt we won't have dangling pointers in the
RelOptInfo objects any more because the RelOptInfos themselves will be
gone. The only problem is that the join_rel_list (and join_rel_hash
if it exists) will be corrupt, but we can fix that using the technique
demonstrated above.

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't. In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything. That seems like a fairly major omission.

Incidentally, I committed 0002, 0003, and 0005 as a single commit with
a few tweaks; I think you may need to do a bit of rebasing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#46)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't. In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything. That seems like a fairly major omission.

Some other comments on 0009:

Documentation changes for the new GUCs are missing.

+between the partition keys of the joining tables. The equi-join between
+partition keys implies that for a given row in a given partition of a given
+partitioned table, its joining row, if exists, should exist only in the
+matching partition of the other partitioned table; no row from non-matching

There could be more than one. I'd write: The equi-join between
partition keys implies that all join partners for a given row in one
partitioned table must be in the corresponding partition of the other
partitioned table.

+#include "miscadmin.h"
#include <limits.h>
#include <math.h>

Added in wrong place.

+                                * System attributes do not need
translation. In such a case,
+                                * the attribute numbers of the parent
and the child should
+                                * start from the same minimum attribute.

I would delete the second sentence and add an Assert() to that effect instead.

+ /* Pass top parent's relids down the inheritance hierarchy. */

Why?

+ for (attno = rel->min_attr; attno <=
rel->max_attr; attno++)

Add add a comment explaining why we need to do this.

-       add_paths_to_append_rel(root, rel, live_childrels);
+       add_paths_to_append_rel(root, rel, live_childrels, false);
 }

-

No need to remove blank line.

+ * When called on partitioned join relation with partition_join_path = true, it
+ * adds PartitionJoinPath instead of Merge/Append path. This path is costed
+ * based on the costs of sampled child-join and is expanded later into
+ * Merge/Append plan.

I'm not a big fan of the Merge/Append terminology here. If somebody
adds another kind of append-path someday, then all of these comments
will have to be updated. I think this can be phrased more
generically.

        /*
+        * While creating PartitionJoinPath, we sample paths from only
a few child
+        * relations. Even if all of sampled children have partial
paths, it's not
+        * guaranteed that all the unsampled children will have partial paths.
+        * Hence we do not create partial PartitionJoinPaths.
+        */

Very sad. I guess if we had parallel append available, we could maybe
dodge this problem, but for now I suppose we're stuck with it.

+       /*
+        * Partitioning scheme in join relation indicates a possibility that the
+        * join may be partitioned, but it's not necessary that every pair of
+        * joining relations can use partition-wise join technique. If one of
+        * joining relations turns out to be unpartitioned, this pair of joining
+        * relations can not use partition-wise join technique.
+        */
+       if (!rel1->part_scheme || !rel2->part_scheme)
+               return;

How can this happen? If rel->part_scheme != NULL, doesn't that imply
that every rel covered by the joinrel is partitioned that way, and
therefore this condition must necessarily hold?

In general, I think it's better style to write explicit tests against
NULL or NIL than to just write if (blahptr).

+ partitioned_join->sjinfo = copyObject(parent_sjinfo);

Why do we need to copy it?

+       /*
+        * Remove the relabel decoration. We can assume that there is
at most one
+        * RelabelType node; eval_const_expressions() simplifies multiple
+        * RelabelType nodes into one.
+        */
+       if (IsA(expr, RelabelType))
+               expr = (Expr *) ((RelabelType *) expr)->arg;

Still, instead of assuming this, you could just s/if/while/, and then
you wouldn't need the assumption any more. Also, consider castNode().

partition_wise_plan_weight may be useful for testing, but I don't
think it should be present in the final patch.

This is not a full review; I ran out of mental energy before I got to
the end. (Sorry.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#46)
Re: Partition-wise join for join between (declaratively) partitioned tables

There are some differences in what geqo does and what partition-wise
needs to do. geqo tries many joining orders each one in a separate
temporary context. The way geqo slices the work, every slice produces
a full plan. For partition-wise join I do not see a way to slice the
work such that the whole path and corresponding RelOptInfos come from
the same slice. So, we can't use the same method as GEQO.

What I was thinking about was the use of this technique for getting
rid of joinrels:

root->join_rel_list = list_truncate(root->join_rel_list,
savelength);
root->join_rel_hash = savehash;

makePathNode() serves to segregate paths into a separate memory
context that can then be destroyed, but as you point out, the path
lists are still hanging around, and so are the RelOptInfo nodes. It
seems to me we could do a lot better using this technique. Suppose we
jigger things so that the List objects created by add_path go into
path_cxt, and so that RelOptInfo nodes also go into path_cxt. Then
when we blow up path_cxt we won't have dangling pointers in the
RelOptInfo objects any more because the RelOptInfos themselves will be
gone. The only problem is that the join_rel_list (and join_rel_hash
if it exists) will be corrupt, but we can fix that using the technique
demonstrated above.

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't.

Right. We need the child-join's RelOptInfos to estimate sizes, so that
we could sample the largest ones. So postponing it looks difficult.

In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything. That seems like a fairly major omission.

The path_cxt reset introduced by 0001 recycles memory used by all the
paths, including paths created for the children. But that happens only
after all the planning has completed. I thought that's what we
discussed to be done. We could create a separate path context for
every top-level child-join. That will require either copying the
cheapest path-tree into root->glob->path_cxt memory context OR will
require it to be converted to a plan immediately. The first will
require spending CPU cycles and memory in copying path-tree. The later
requires almost all the create_*_append_plan() code to be duplicated
in create_partition_join_plan() which is ugly. In an earlier version
of this patch I had that code, which I got rid of in the latest set of
patches. Between those two the first looks better.

Incidentally, I committed 0002, 0003, and 0005 as a single commit with
a few tweaks; I think you may need to do a bit of rebasing.

Thanks. I will have fewer patches to rebase now :).

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#47)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 15, 2017 at 6:51 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 14, 2017 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't. In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything. That seems like a fairly major omission.

Some other comments on 0009:

Documentation changes for the new GUCs are missing.

Done. The description might need more massaging, but I will work on
that once we have fixed their names and usage. I think
sample_partition_fraction and partition_wise_plan_weight, if retained,
will be applicable to other partition-wise planning like
partition-wise aggregates. So we will need more generic description
there.

+between the partition keys of the joining tables. The equi-join between
+partition keys implies that for a given row in a given partition of a given
+partitioned table, its joining row, if exists, should exist only in the
+matching partition of the other partitioned table; no row from non-matching

There could be more than one. I'd write: The equi-join between
partition keys implies that all join partners for a given row in one
partitioned table must be in the corresponding partition of the other
partitioned table.

Done. I think it's important to emphasize the the joining partners can
not be in other partitions. So, added that sentence after your
suggested sentence.

+#include "miscadmin.h"
#include <limits.h>
#include <math.h>

Added in wrong place.

Done.

+                                * System attributes do not need
translation. In such a case,
+                                * the attribute numbers of the parent
and the child should
+                                * start from the same minimum attribute.

I would delete the second sentence and add an Assert() to that effect instead.

The assertion is there just few lines down. Please let me know if that
suffices. Deleted the second sentence.

+ /* Pass top parent's relids down the inheritance hierarchy. */

Why?

That is required for a multi-level partitioned table.
top_parent_relids are used for translating expressions of the top
parent to that of child table.

+ for (attno = rel->min_attr; attno <=
rel->max_attr; attno++)

Add add a comment explaining why we need to do this.

The comment is there just few lines above. I have moved it just above
this for loop.

-       add_paths_to_append_rel(root, rel, live_childrels);
+       add_paths_to_append_rel(root, rel, live_childrels, false);
}

-

No need to remove blank line.

Sorry. That was added by my patch to refactor
set_append_rel_pathlist(). I have added a patch in the series to
remove that line.

+ * When called on partitioned join relation with partition_join_path = true, it
+ * adds PartitionJoinPath instead of Merge/Append path. This path is costed
+ * based on the costs of sampled child-join and is expanded later into
+ * Merge/Append plan.

I'm not a big fan of the Merge/Append terminology here. If somebody
adds another kind of append-path someday, then all of these comments
will have to be updated. I think this can be phrased more
generically.

Reworded as
+ * When partition_join_path is true, the caller intends to add a
+ * PartitionJoinPath costed based on the sampled child-joins passed as
+ * live_childrels.

Also added an assertion to make sure the partition_join_path is true
only for join relations.

/*
+        * While creating PartitionJoinPath, we sample paths from only
a few child
+        * relations. Even if all of sampled children have partial
paths, it's not
+        * guaranteed that all the unsampled children will have partial paths.
+        * Hence we do not create partial PartitionJoinPaths.
+        */

Very sad. I guess if we had parallel append available, we could maybe
dodge this problem, but for now I suppose we're stuck with it.

Really sad. Is there a way to look at the relation (without any
partial paths yet) and see whether the relation will have partial
paths or not. Even if we don't have actual partial paths but know that
there will be at least one added in the future, we will be able to fix
this problem.

+       /*
+        * Partitioning scheme in join relation indicates a possibility that the
+        * join may be partitioned, but it's not necessary that every pair of
+        * joining relations can use partition-wise join technique. If one of
+        * joining relations turns out to be unpartitioned, this pair of joining
+        * relations can not use partition-wise join technique.
+        */
+       if (!rel1->part_scheme || !rel2->part_scheme)
+               return;

How can this happen? If rel->part_scheme != NULL, doesn't that imply
that every rel covered by the joinrel is partitioned that way, and
therefore this condition must necessarily hold?

I don't remember exactly, but this was added considering a more
generic partition-wise join. But then we would have more changes when
we support that. So, turned this into an assertion.

In general, I think it's better style to write explicit tests against
NULL or NIL than to just write if (blahptr).

PG code uses both the styles. Take for example
src/backend/rewrite/rewriteManip.c or any file, both styles are being
used. I find this style useful, when I want to code, say "if this
relation does not have a partitioning scheme" rather than "if this
relation have NULL partitioning scheme". Although I don't have
objections changing it as per your suggestion.

+ partitioned_join->sjinfo = copyObject(parent_sjinfo);

Why do we need to copy it?

sjinfo in make_join_rel() may be from root->join_info_list or it could
be one made up locally in that function. The one made up in that
function would go away with that function, whereas we need it much
later to create paths for child-joins. So, I thought it's better to
copy it. But now I have changed to code to pass NULL for a made-up
sjinfo. In such a case, the child-join's sjinfo is also made up. This
required some refactoring to separate out the making-up code. So,
there's new refactoring patch.

+       /*
+        * Remove the relabel decoration. We can assume that there is
at most one
+        * RelabelType node; eval_const_expressions() simplifies multiple
+        * RelabelType nodes into one.
+        */
+       if (IsA(expr, RelabelType))
+               expr = (Expr *) ((RelabelType *) expr)->arg;

Still, instead of assuming this, you could just s/if/while/, and then
you wouldn't need the assumption any more. Also, consider castNode().

Done.

partition_wise_plan_weight may be useful for testing, but I don't
think it should be present in the final patch.

partition_join test needs it so that it can work with smaller dataset
and complete faster. For smaller data sets the partition-wise join
paths come out to be costlier than other kinds and are never chosen.
By setting partition_wise_plan_weight I can force partition-wise join
to be chosen. An alternate solution would be to use
sample_partition_fraction = 1.0, but then we will never test delayed
planning for unsampled child-joins. I also think that users will find
partition_wise_plan_weight useful when estimates based on samples are
unrealistic. Obviously, in a longer run we should be able to provide
better estimates.

Apart from this, I have also removed recursive calls to
try_partition_wise_join() and generate_partition_wise_join_paths()
from 0009 and places them in the 0014 patch. Those are required for
multi-level partitioned tables, which are not supported in 0009.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v4.zipapplication/zip; name=pg_dp_join_patches_v4.zip
#50Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#48)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 15, 2017 at 8:49 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't.

Right. We need the child-join's RelOptInfos to estimate sizes, so that
we could sample the largest ones. So postponing it looks difficult.

You have a point.

In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything. That seems like a fairly major omission.

The path_cxt reset introduced by 0001 recycles memory used by all the
paths, including paths created for the children. But that happens only
after all the planning has completed. I thought that's what we
discussed to be done. We could create a separate path context for
every top-level child-join.

I don't think we need to create a new context for each top-level
child-join, but I think we should create a context to be used across
all top-level child-joins and then reset it after planning each one.
I thought the whole point here was that NOT doing that caused the
memory usage for partitionwise join to get out of control. Am I
confused?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#49)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 15, 2017 at 8:55 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Sorry. That was added by my patch to refactor
set_append_rel_pathlist(). I have added a patch in the series to
remove that line.

It's not worth an extra commit just to change what isn't broken.
Let's just leave it alone.

Very sad. I guess if we had parallel append available, we could maybe
dodge this problem, but for now I suppose we're stuck with it.

Really sad. Is there a way to look at the relation (without any
partial paths yet) and see whether the relation will have partial
paths or not. Even if we don't have actual partial paths but know that
there will be at least one added in the future, we will be able to fix
this problem.

I don't think so. If we know that rel->consider_parallel will end up
true for a plain table, we should always get a parallel sequential
scan path at least, but if there are foreign tables involved, then
nothing is guaranteed.

partition_wise_plan_weight may be useful for testing, but I don't
think it should be present in the final patch.

partition_join test needs it so that it can work with smaller dataset
and complete faster. For smaller data sets the partition-wise join
paths come out to be costlier than other kinds and are never chosen.
By setting partition_wise_plan_weight I can force partition-wise join
to be chosen. An alternate solution would be to use
sample_partition_fraction = 1.0, but then we will never test delayed
planning for unsampled child-joins. I also think that users will find
partition_wise_plan_weight useful when estimates based on samples are
unrealistic. Obviously, in a longer run we should be able to provide
better estimates.

I still don't like it -- we have no other similar knob.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#51)
Re: Partition-wise join for join between (declaratively) partitioned tables

So I am looking at this part of 0008:

+       /*
+        * Do not copy parent_rinfo and child_rinfos because 1. they create a
+        * circular dependency between child and parent RestrictInfo 2. dropping
+        * those links just means that we loose some memory
optimizations. 3. There
+        * is a possibility that the child and parent RestrictInfots
themselves may
+        * have got copied and thus the old links may no longer be valid. The
+        * caller may set up those links itself, if needed.
+        */

I don't think that it's very clear whether or not this is safe. I
experimented with making _copyRestrictInfo PANIC, which,
interestingly, does not affect the core regression tests at all, but
does trip on this bit from the postgres_fdw tests:

-- subquery using stable function (can't be sent to remote)
PREPARE st2(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3 IN
(SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c4) =
'1970-01-17'::date) ORDER BY c1;
EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st2(10, 20);

I'm not sure why this particular case is affected when so many others
are not, and the comment doesn't help me very much in figuring it out.

Why do we need this cache in the RestrictInfo, anyway? Aside from the
comment above, I looked at the comment in the RestrictInfo struct, and
I looked at the comment in build_child_restrictinfo, and I looked at
the comment in build_child_clauses, and I looked at the place where
build_child_clauses is called in set_append_rel_size, and none of
those places explain why we need this cache. I would assume we'd need
a separate translation of the RestrictInfo for every separate
child-join, so how does the cache help?

Maybe the answer is that build_child_clauses() is also called from
try_partition_wise_join() and add_paths_to_child_joinrel(), and those
three call sights all end up producing the same set of translated
RestrictInfos. But if that's the case, somehow it seems like we ought
to be producing these in one place where we can get convenient access
to them from each child join, rather than having to search through
this cache to find it. It's a pretty inefficient cache: it takes O(n)
time to search it, I think, where n is the number of partitions. And
you do O(n) searches. So it's an O(n^2) algorithm, which is a little
unfortunate. Can't we affix the translated RestrictInfos someplace
where they can be found more efficiently?

Yet another thing that the comments don't explain is why the existing
adjust_appendrel_attrs call needs to be replaced with
build_child_clauses.

So I feel, overall, that the point of all of this is not explained well at all.

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#50)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 16, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 8:49 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Of course, that supposes that 0009 can manage to postpone creating
non-sampled child joinrels until create_partition_join_plan(), which
it currently doesn't.

Right. We need the child-join's RelOptInfos to estimate sizes, so that
we could sample the largest ones. So postponing it looks difficult.

You have a point.

In fact, unless I'm missing something, 0009
hasn't been even slightly adapted to take advantage of the
infrastructure in 0001; it doesn't seem to reset the path_cxt or
anything. That seems like a fairly major omission.

The path_cxt reset introduced by 0001 recycles memory used by all the
paths, including paths created for the children. But that happens only
after all the planning has completed. I thought that's what we
discussed to be done. We could create a separate path context for
every top-level child-join.

I don't think we need to create a new context for each top-level
child-join, but I think we should create a context to be used across
all top-level child-joins and then reset it after planning each one.

Sorry, that's what I meant by creating a new context for each
top-level child-join. So, we need to copy the required path tree
before resetting the context. I am fine doing that but read on.

I thought the whole point here was that NOT doing that caused the
memory usage for partitionwise join to get out of control. Am I
confused?

We took a few steps to reduce the memory footprint of partition-wise
join in [1]/messages/by-id/CAFjFpRcZ_M3-JxoiDkdoPS+-9Cok4ux9Si+4drcRL-62af=jWw@mail.gmail.com and [2]/messages/by-id/CAFjFpRe66z+w9+dnAkWGiaB1CU2CUQsLGsqzHzYBoA=KJFf+PQ@mail.gmail.com. According to the numbers reported in [1]/messages/by-id/CAFjFpRcZ_M3-JxoiDkdoPS+-9Cok4ux9Si+4drcRL-62af=jWw@mail.gmail.com and then
in [2]/messages/by-id/CAFjFpRe66z+w9+dnAkWGiaB1CU2CUQsLGsqzHzYBoA=KJFf+PQ@mail.gmail.com, if the total memory consumed by a planner is 44MB (memory
consumed by paths 150K) for a 5-way non-parition-wise join between
tables with 1000 partitions, partition-wise join consumed 192MB which
is 4.4 times the non-partitino-wise case. The earlier implementation
of blowing away a memory context after each top-level child-join, just
got rid of the paths created for that child-join. The total memory
consumed by paths created for all the child-joins was about 150MB.
Remember that we can not get rid of memory consumed by expressions,
RelOptInfos, RestrictInfos etc. since their pointers will be copied
into the plan nodes.

With changes in 0001, what happens is we accumulate 150MB till the end
of the planning and get rid of it after we have created a plan. So,
till the plan is created we are consuming approx. 192MB + 150MB =
342MB memory and are getting rid of 150MB memory after we have created
the plan. I am not sure whether consuming extra 150MB or for that
matter 342MB in a setup with a thousand partitions is "going out of
control". (342MB is approx. 7.8 time 44MB; not 1000 times, and not
even 10 times). But if you think that we should throw away unused
paths after planning every top-level child-join I am fine with it.

[1]: /messages/by-id/CAFjFpRcZ_M3-JxoiDkdoPS+-9Cok4ux9Si+4drcRL-62af=jWw@mail.gmail.com
[2]: /messages/by-id/CAFjFpRe66z+w9+dnAkWGiaB1CU2CUQsLGsqzHzYBoA=KJFf+PQ@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#51)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 16, 2017 at 12:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 8:55 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Sorry. That was added by my patch to refactor
set_append_rel_pathlist(). I have added a patch in the series to
remove that line.

It's not worth an extra commit just to change what isn't broken.
Let's just leave it alone.

Ok. Removed that patch from the set of patches.

Very sad. I guess if we had parallel append available, we could maybe
dodge this problem, but for now I suppose we're stuck with it.

Really sad. Is there a way to look at the relation (without any
partial paths yet) and see whether the relation will have partial
paths or not. Even if we don't have actual partial paths but know that
there will be at least one added in the future, we will be able to fix
this problem.

I don't think so. If we know that rel->consider_parallel will end up
true for a plain table, we should always get a parallel sequential
scan path at least, but if there are foreign tables involved, then
nothing is guaranteed.

Ok.

partition_wise_plan_weight may be useful for testing, but I don't
think it should be present in the final patch.

partition_join test needs it so that it can work with smaller dataset
and complete faster. For smaller data sets the partition-wise join
paths come out to be costlier than other kinds and are never chosen.
By setting partition_wise_plan_weight I can force partition-wise join
to be chosen. An alternate solution would be to use
sample_partition_fraction = 1.0, but then we will never test delayed
planning for unsampled child-joins. I also think that users will find
partition_wise_plan_weight useful when estimates based on samples are
unrealistic. Obviously, in a longer run we should be able to provide
better estimates.

I still don't like it -- we have no other similar knob.

We have another cost-skewing GUC, disable_cost, which adds a huge cost
to anything that needs to be disabled. This is different in the sense
that it multiplies the cost.

Well, in that case, we can replace it with force_partition_wise_plan
(on/off) for the sake of regression, to test with smaller data. Even
then we will need to adjust the costs, so that partition-wise join
plan comes out to be the cheapest. Probably we will need set
partition-wise join plan costs to very low or even 0 when
force_partition_wise_plan is set to on. Does that look good? Any other
ideas?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#52)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 16, 2017 at 7:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

So I am looking at this part of 0008:

+       /*
+        * Do not copy parent_rinfo and child_rinfos because 1. they create a
+        * circular dependency between child and parent RestrictInfo 2. dropping
+        * those links just means that we loose some memory
optimizations. 3. There
+        * is a possibility that the child and parent RestrictInfots
themselves may
+        * have got copied and thus the old links may no longer be valid. The
+        * caller may set up those links itself, if needed.
+        */

I don't think that it's very clear whether or not this is safe. I
experimented with making _copyRestrictInfo PANIC,

I am not able to understand how to make _copyRestrictInfo PANIC. Can
you please share the patch or compiler flags or settings? I will look
at the case below once I have that.

which,
interestingly, does not affect the core regression tests at all, but
does trip on this bit from the postgres_fdw tests:

-- subquery using stable function (can't be sent to remote)
PREPARE st2(int) AS SELECT * FROM ft1 t1 WHERE t1.c1 < $2 AND t1.c3 IN
(SELECT c3 FROM ft2 t2 WHERE c1 > $1 AND date(c4) =
'1970-01-17'::date) ORDER BY c1;
EXPLAIN (VERBOSE, COSTS OFF) EXECUTE st2(10, 20);

I'm not sure why this particular case is affected when so many others
are not, and the comment doesn't help me very much in figuring it out.

Why do we need this cache in the RestrictInfo, anyway? Aside from the
comment above, I looked at the comment in the RestrictInfo struct, and
I looked at the comment in build_child_restrictinfo, and I looked at
the comment in build_child_clauses, and I looked at the place where
build_child_clauses is called in set_append_rel_size, and none of
those places explain why we need this cache. I would assume we'd need
a separate translation of the RestrictInfo for every separate
child-join, so how does the cache help?

Maybe the answer is that build_child_clauses() is also called from
try_partition_wise_join() and add_paths_to_child_joinrel(), and those
three call sights all end up producing the same set of translated
RestrictInfos. But if that's the case, somehow it seems like we ought
to be producing these in one place where we can get convenient access
to them from each child join, rather than having to search through
this cache to find it.

I had explained this briefly in [1]/messages/by-id/CAFjFpRe66z+w9+dnAkWGiaB1CU2CUQsLGsqzHzYBoA=KJFf+PQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company. But forgot to add it as comments.

There are multiple means by which a RestrictInfo gets translated
multiple times for the same child.

1. Consider a join A J (B J C on B.b = C.c) ON (A.a = B.b) the clause
A.a = B.b is part of the restrictlist for orders (AB)C and A(BC) (and
(AC)B depending upon the type of join). So, the clause gets translated
twice once for each of those join orders.

2. In the above example, A.a = B.b is part of joininfo list (if it
happens to be an outer join) of A, B and BC. So, it should be part of
joininfo list of children of A, B and BC. But the RestrictInfo which
is part of joininfo of B and BC looks exactly same.

Similarly param_info->clauses get translated multiple times each time
with a different set of required_outer.

In order to avoid multiple translations and spend memory in each
translation it's better to cache the result and retrieve it.

Updated prologue of build_child_restrictinfo with this explanation.

It's a pretty inefficient cache: it takes O(n)
time to search it, I think, where n is the number of partitions.

Above explanation shows that it's worse than that.

And
you do O(n) searches. So it's an O(n^2) algorithm, which is a little
unfortunate. Can't we affix the translated RestrictInfos someplace
where they can be found more efficiently?

Would a hash similar to root->join_rel_hash help? That will reduce the
searches to O(1). I have added a separate patch (0008) for using
hashtable to store child restrictinfos. If that patch looks good to
you, I will merge it with the main patch supporting partition-wise
join.

Yet another thing that the comments don't explain is why the existing
adjust_appendrel_attrs call needs to be replaced with
build_child_clauses.

The call to adjust_appendrel_attrs() used to translate joininfo for
child has been replaced by build_child_clauses to take advantage of
the RestrictInfo cache. As explained above a clause which is part of
joininfo of a child, is also part of joininfo of the child-join in
which it participates except the child-joins covering the clause. So,
a cached copy of that RestrictInfo helps. I have added a patch (0010)
to use build_child_clause() only for partitioned tables and use
adjust_appendrel_attrs() for non-partitioned case. If this change
looks good, I will merge it with the main patch.

So I feel, overall, that the point of all of this is not explained well at all.

Sorry for that. I should have added the explanation in the comments.
Corrected this in this set of patches.

[1]: /messages/by-id/CAFjFpRe66z+w9+dnAkWGiaB1CU2CUQsLGsqzHzYBoA=KJFf+PQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v5.zipapplication/zip; name=pg_dp_join_patches_v5.zip
#56Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#53)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 16, 2017 at 6:48 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I thought the whole point here was that NOT doing that caused the
memory usage for partitionwise join to get out of control. Am I
confused?

We took a few steps to reduce the memory footprint of partition-wise
join in [1] and [2]. According to the numbers reported in [1] and then
in [2], if the total memory consumed by a planner is 44MB (memory
consumed by paths 150K) for a 5-way non-parition-wise join between
tables with 1000 partitions, partition-wise join consumed 192MB which
is 4.4 times the non-partitino-wise case. The earlier implementation
of blowing away a memory context after each top-level child-join, just
got rid of the paths created for that child-join. The total memory
consumed by paths created for all the child-joins was about 150MB.
Remember that we can not get rid of memory consumed by expressions,
RelOptInfos, RestrictInfos etc. since their pointers will be copied
into the plan nodes.

All right, I propose that we revise our plan for attacking this
problem. The code in this patch that proposes to reduce memory
utilization is very complicated and it's likely to cause us to miss
this release altogether if we keep hacking on it. So, I propose that
you refactor this patch series so that the first big patch is
partition-wise join without any of the optimizations that save memory
- essentially the sample_partition_fraction = 1 case with all
memory-saving optimizations removed. If it's only there to save
memory, rip it out. Also, change the default value of
enable_partition_wise_join to false and document that turning it on
may cause a large increase in planner memory utilization, and that's
why it's not enabled by default.

If we get that committed, then we can have follow-on patches that add
the incremental path creation stuff and other memory-saving features,
and then at the end we can flip the default from "off" to "on".
Probably that last part will slip beyond v10 since we're only two
weeks from the end of the release cycle, but I think that's still
better than having everything slip. Let's also put the multi-level
partition-wise join stuff ahead of the memory-saving stuff, because
being able to do only a single-level of partition-wise join is a
fairly unimpressive feature; I'm not sure this is really even
committable without that.

I realize in some sense that I'm telling you to go and undo all of the
work that you just did based on what I told you before, but I think
we've actually made some pretty good progress here: it's now clear
that there are viable strategies for getting the memory usage down to
an acceptable level, and we've got draft patches for those strategies.
So committing the core feature without immediately including that work
can't be regarded as breaking everything hopelessly; rather, it now
looks (I think, anyway) like a reasonable intermediate step towards
the eventual goal.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#55)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 16, 2017 at 7:19 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Mar 16, 2017 at 7:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

So I am looking at this part of 0008:

+       /*
+        * Do not copy parent_rinfo and child_rinfos because 1. they create a
+        * circular dependency between child and parent RestrictInfo 2. dropping
+        * those links just means that we loose some memory
optimizations. 3. There
+        * is a possibility that the child and parent RestrictInfots
themselves may
+        * have got copied and thus the old links may no longer be valid. The
+        * caller may set up those links itself, if needed.
+        */

I don't think that it's very clear whether or not this is safe. I
experimented with making _copyRestrictInfo PANIC,

I am not able to understand how to make _copyRestrictInfo PANIC. Can
you please share the patch or compiler flags or settings? I will look
at the case below once I have that.

I just put elog(PANIC, "_copyRestrictInfo") into the function.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#56)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 16, 2017 at 8:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 16, 2017 at 6:48 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I thought the whole point here was that NOT doing that caused the
memory usage for partitionwise join to get out of control. Am I
confused?

We took a few steps to reduce the memory footprint of partition-wise
join in [1] and [2]. According to the numbers reported in [1] and then
in [2], if the total memory consumed by a planner is 44MB (memory
consumed by paths 150K) for a 5-way non-parition-wise join between
tables with 1000 partitions, partition-wise join consumed 192MB which
is 4.4 times the non-partitino-wise case. The earlier implementation
of blowing away a memory context after each top-level child-join, just
got rid of the paths created for that child-join. The total memory
consumed by paths created for all the child-joins was about 150MB.
Remember that we can not get rid of memory consumed by expressions,
RelOptInfos, RestrictInfos etc. since their pointers will be copied
into the plan nodes.

All right, I propose that we revise our plan for attacking this
problem. The code in this patch that proposes to reduce memory
utilization is very complicated and it's likely to cause us to miss
this release altogether if we keep hacking on it. So, I propose that
you refactor this patch series so that the first big patch is
partition-wise join without any of the optimizations that save memory
- essentially the sample_partition_fraction = 1 case with all
memory-saving optimizations removed. If it's only there to save
memory, rip it out. Also, change the default value of
enable_partition_wise_join to false and document that turning it on
may cause a large increase in planner memory utilization, and that's
why it's not enabled by default.

If we get that committed, then we can have follow-on patches that add
the incremental path creation stuff and other memory-saving features,
and then at the end we can flip the default from "off" to "on".
Probably that last part will slip beyond v10 since we're only two
weeks from the end of the release cycle, but I think that's still
better than having everything slip. Let's also put the multi-level
partition-wise join stuff ahead of the memory-saving stuff, because
being able to do only a single-level of partition-wise join is a
fairly unimpressive feature; I'm not sure this is really even
committable without that.

I realize in some sense that I'm telling you to go and undo all of the
work that you just did based on what I told you before, but I think
we've actually made some pretty good progress here: it's now clear
that there are viable strategies for getting the memory usage down to
an acceptable level, and we've got draft patches for those strategies.
So committing the core feature without immediately including that work
can't be regarded as breaking everything hopelessly; rather, it now
looks (I think, anyway) like a reasonable intermediate step towards
the eventual goal.

Here's the set of patches with all the memory saving stuff removed.
It's now bare partition-wise joins. I have tried to eliminate all
memory saving stuff carefully, except few bms_free() and list_free()
which fit the functions they were part of and mostly were present in
my earlier versions of patches. But I might have missed some. Also, I
have corrected any indentation/white space mistakes introduced by
editing patches with +/-, but I might have missed some. Please let me
know.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v6.zipapplication/zip; name=pg_dp_join_patches_v6.zip
#59Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#58)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Here's the set of patches with all the memory saving stuff removed.
It's now bare partition-wise joins. I have tried to eliminate all
memory saving stuff carefully, except few bms_free() and list_free()
which fit the functions they were part of and mostly were present in
my earlier versions of patches. But I might have missed some. Also, I
have corrected any indentation/white space mistakes introduced by
editing patches with +/-, but I might have missed some. Please let me
know.

Rajkumar offlist reported two issues with earlier set of patches.
1. 0008 conflicted with latest changes in postgres_fdw/deparse.c.

2. In the earlier set of patches part_scheme of a join relation was
being set when joining relations had same part_scheme even if there
was no equi-join between partition keys. The idea being that
rel->part_scheme and rel->part_rels together tell whether a relation
is partitioned or not. At a later stage if none of the joining pairs
resulted in partitioned join, part_rels would be NULL and then we
would reset part_scheme as well. But this logic not required. For the
exact partition scheme matching, that we are using, if one pair of
joining relation has an equi-join on partition keys, and both of those
have exactly same partitioning scheme, all other pairs of joining
relations would have an equi-join on partition keys and also exactly
same partitioning scheme. So, we can set part_scheme only by looking
at the first pair of joining relation while building the child-join.

This set of patches fixes both of those things.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v7.zipapplication/zip; name=pg_dp_join_patches_v7.zip
#60Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#59)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

This set of patches fixes both of those things.

0001 changes the purpose of a function and then 0007 renames it. It
would be better to include the renaming in 0001 so that you're not
taking multiple whacks at the same function in the same patch series.
I believe it would also be best to include 0011's changes to
adjust_appendrel_attrs_multilevel in 0001.

0002 should either add find_param_path_info() to the relevant header
file as extern from the beginning, or it should declare and define it
as static and then 0007 can remove those markings. It makes no sense
to declare it as extern but put the prototype in the .c file.

0004 still needs to be pared down. If you want to get something
committed this release cycle, you have to get these details taken care
of, uh, more or less immediately. Actually, preferably, several weeks
ago. You're welcome to maintain your own test suite locally but what
you submit should be what you are proposing for commit -- or if not,
then you should separate the part proposed for commit and the part
included for dev testing into two different patches.

In 0005's README, the part about planning partition-wise joins in two
phases needs to be removed. This patch also contains a small change
to partition_join.sql that belongs in 0004.

0008 removes direct tests against RELOPT_JOINREL almost everywhere,
but it overlooks the new ones added to postgres_fdw.c by
b30fb56b07a885f3476fe05920249f4832ca8da5. It should be updated to
cover those as well, I suspect. The commit message claims that it
will "Similarly replace RELOPT_OTHER_MEMBER_REL test with
IS_OTHER_REL() where we want to test for child relations of all kinds,
but in fact it makes exactly zero such substitutions.

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths. I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that. Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so. Still, you might want
to investigate a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Robert Haas (#60)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Mar 18, 2017 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

This set of patches fixes both of those things.

0001 changes the purpose of a function and then 0007 renames it. It
would be better to include the renaming in 0001 so that you're not
taking multiple whacks at the same function in the same patch series.
I believe it would also be best to include 0011's changes to
adjust_appendrel_attrs_multilevel in 0001.

0002 should either add find_param_path_info() to the relevant header
file as extern from the beginning, or it should declare and define it
as static and then 0007 can remove those markings. It makes no sense
to declare it as extern but put the prototype in the .c file.

0004 still needs to be pared down. If you want to get something
committed this release cycle, you have to get these details taken care
of, uh, more or less immediately. Actually, preferably, several weeks
ago. You're welcome to maintain your own test suite locally but what
you submit should be what you are proposing for commit -- or if not,
then you should separate the part proposed for commit and the part
included for dev testing into two different patches.

In 0005's README, the part about planning partition-wise joins in two
phases needs to be removed. This patch also contains a small change
to partition_join.sql that belongs in 0004.

0008 removes direct tests against RELOPT_JOINREL almost everywhere,
but it overlooks the new ones added to postgres_fdw.c by
b30fb56b07a885f3476fe05920249f4832ca8da5. It should be updated to
cover those as well, I suspect. The commit message claims that it
will "Similarly replace RELOPT_OTHER_MEMBER_REL test with
IS_OTHER_REL() where we want to test for child relations of all kinds,
but in fact it makes exactly zero such substitutions.

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths. I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that. Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so. Still, you might want
to investigate a bit.

I was trying to play around with this patch and came across following
case when without the patch query completes in 9 secs and with it in
15 secs. Theoretically, I tried to capture the case when each
partition is having good amount of rows in output and each has to
build their own hash, in that case the cost of building so many hashes
comes to be more costly than having an append and then join. Thought
it might be helpful to consider this case in better designing of the
algorithm. Please feel free to point out if I missed something.

Test details:
commit: b4ff8609dbad541d287b332846442b076a25a6df
Please find the attached .sql file for the complete schema and data
and .out file for the result of explain analyse with and without
patch.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

pwj_regress_test.outapplication/octet-stream; name=pwj_regress_test.out
test_case_pwj.sqlapplication/octet-stream; name=test_case_pwj.sql
#62Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Rafia Sabih (#61)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sun, Mar 19, 2017 at 12:15 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

I was trying to play around with this patch and came across following
case when without the patch query completes in 9 secs and with it in
15 secs. Theoretically, I tried to capture the case when each
partition is having good amount of rows in output and each has to
build their own hash, in that case the cost of building so many hashes
comes to be more costly than having an append and then join. Thought
it might be helpful to consider this case in better designing of the
algorithm. Please feel free to point out if I missed something.

In the non-partitionwise plan, the query planner correctly chooses to
hash the same table (prt2) and probe from the large table (prt). In
the partition-wise plan, it generally does the opposite. There is a
mix of merge joins and hash joins, but of the 15 children that picked
merge joins, 14 of them hashed the larger partition (in each case,
from prt) and probed from the smaller one (in each case, from prt2),
which seems like an odd strategy. So I think the problem is not that
building lots of hash tables is slower than building just one, but
rather that for some reason it's choosing the wrong table to hash.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#60)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths. I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that. Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so. Still, you might want
to investigate a bit.

I spent a fair amount of time this weekend musing over
reparameterize_path_by_child(). I think a key question for this patch
- as you already pointed out - is whether we're happy with that
approach. When we discover that we want to perform a partitionwise
parameterized nestloop, and therefore that we need the paths for each
inner appendrel to get their input values from the corresponding outer
appendrel members rather than from the outer parent, we've got two
choices. The first is to do what the patch actually does, which is to
build a new path tree for the nestloop inner path parameterized by the
appropriate childrel. The second is to use the existing paths, which
are parameterized by the parent rel, and then somehow allow make that
work. For example, you can imagine that create_plan_recurse() could
pass down a list of parameterized nestloops above the current point in
the path tree, and a parent-child mapping for each, and then we could
try to substitute everything while actually generating the plan
instead of creating paths sooner. Which is better?

It would be nice to hear opinions from anyone else who cares, but
after some thought I think the approach you've picked is probably
better, because it's more like what we do already. We have existing
precedent for reparameterizing a path, but none for allowing a Var for
one relation (the parent) to in effect refer to another relation (the
child).

That having been said, having try_nestloop_path() perform the
reparameterization at the very top of the function seems quite
undesirable. You're creating a new path there before you know whether
it's going to be rejected by the invalid-parameterization test and
also before you know whether initial_cost_nestloop is going to reject
it. It would be much better if you could find a way to postpone the
reparameterization until after those steps, and only do it if you're
going to try add_path().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#60)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Mar 18, 2017 at 5:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 17, 2017 at 9:15 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

This set of patches fixes both of those things.

0001 changes the purpose of a function and then 0007 renames it. It
would be better to include the renaming in 0001 so that you're not
taking multiple whacks at the same function in the same patch series.

adjust_relid_set was renamed as adjust_child_relids() post
"extern"alising. I think, this comment is about that function. Done.

I believe it would also be best to include 0011's changes to
adjust_appendrel_attrs_multilevel in 0001.

The function needs to repeat the "adjustment" process for every
"other" relation (join or base) that it encounters, by testing using
OTHER_BASE_REL or OTHER_JOINREL in short IS_OTHER_REL(). The last
macros are added by the partition-wise join implementation patch 0005.
It doesn't make sense to add that macro in 0001 OR modify that
function twice, once in 0001 and then after 0005. So, I will leave it
to be part of 0011, where the changes are actually needed.

0002 should either add find_param_path_info() to the relevant header
file as extern from the beginning, or it should declare and define it
as static and then 0007 can remove those markings. It makes no sense
to declare it as extern but put the prototype in the .c file.

Done, added find_param_path_info() as an extern definition to start
with. I have also squashed 0001 and 0002 together, since they are both
refactoring patches and from your next mail about
reparameterize_path_by_child(), it seems that we are going to accept
the approach in that patch.

0004 still needs to be pared down. If you want to get something
committed this release cycle, you have to get these details taken care
of, uh, more or less immediately. Actually, preferably, several weeks
ago. You're welcome to maintain your own test suite locally but what
you submit should be what you are proposing for commit -- or if not,
then you should separate the part proposed for commit and the part
included for dev testing into two different patches.

Done. Now SQL file has 325 lines and output has 1697 lines as against
515 and 4085 lines resp. earlier.

In 0005's README, the part about planning partition-wise joins in two
phases needs to be removed.

Done.

This patch also contains a small change
to partition_join.sql that belongs in 0004.

The reason I added the test patch prior to implementation was 1. for
me to make sure the tests that the queries run without the
optimization and the results they produce to catch any issues with
partitioning implementation. That would help someone looking at those
patches as well. 2. Once partitioning implementation patch was
applied, once could see the purpose of changes in two follow on
patches. Now that that purpose has served, I have reordered the
patches so that test patch comes after the implementation and follow
on fixes. If you still want to run the test before or after any of
those patches, you could apply the patch separately.

0008 removes direct tests against RELOPT_JOINREL almost everywhere,
but it overlooks the new ones added to postgres_fdw.c by
b30fb56b07a885f3476fe05920249f4832ca8da5. It should be updated to
cover those as well, I suspect.

Done.

deparseSubqueryTargetList() and some other functions are excluding
"other" base relation from the assertions. I guess, that's a problem.
Will submit a separate patch to fix this.

The commit message claims that it
will "Similarly replace RELOPT_OTHER_MEMBER_REL test with
IS_OTHER_REL() where we want to test for child relations of all kinds,
but in fact it makes exactly zero such substitutions.

The relevant changes have been covered by other commits. Removed this
line from the commit message.

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths. I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that. Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so. Still, you might want
to investigate a bit.

Yes, we need to update reparameterize_path() for child-joins. A path
for child base relation gets reparameterized, if there exists a path
with that parameterization in at least one other child. The
parameterization bubbles up the join tree from base relations. So, if
a child required to be reparameterized, probably all its joins require
reparameterization, since that parameterization would bubble up the
child-join tree in which some other child participates. But as you
said it's an optimization and not a correctness issue. The function
get_cheapest_parameterized_child_path() returns NULL, if it can not
find or create a path (by reparameterization) with required
parameterization. Its caller add_paths_to_append_rel() is capable of
handling NULL values by not creating append paths with that
paramterization. If the "append" relation requires minimum
parameterization, all its children will create that minimum
parameterization, hence do not require to reparameterize path. So,
there isn't any correctness issue there.

There are two ways to fix it,

1. when we create a reparameterized path add it to the list of paths,
thus the parameterization bubbles up the join tree. But then we will
be changing the path list after set_cheapest() has been called OR may
be throwing out paths which other paths refer to. That's not
desirable. May be we can save this path in another list and create
join paths using this path instead of reparameterizing existing join
paths.
2. Add code to reparameterize_path() to handle join paths, and I think
all kinds of paths since we might have trickle the parameterization
down the joining paths which could be almost anything including
sort_paths, unique_paths etc. That looks like a significant effort. I
think, we should attack it separately after the stock partition-wise
join has been committed.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#63)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 8:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths. I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that. Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so. Still, you might want
to investigate a bit.

I spent a fair amount of time this weekend musing over
reparameterize_path_by_child(). I think a key question for this patch
- as you already pointed out - is whether we're happy with that
approach. When we discover that we want to perform a partitionwise
parameterized nestloop, and therefore that we need the paths for each
inner appendrel to get their input values from the corresponding outer
appendrel members rather than from the outer parent, we've got two
choices. The first is to do what the patch actually does, which is to
build a new path tree for the nestloop inner path parameterized by the
appropriate childrel. The second is to use the existing paths, which
are parameterized by the parent rel, and then somehow allow make that
work. For example, you can imagine that create_plan_recurse() could
pass down a list of parameterized nestloops above the current point in
the path tree, and a parent-child mapping for each, and then we could
try to substitute everything while actually generating the plan
instead of creating paths sooner. Which is better?

It would be nice to hear opinions from anyone else who cares, but
after some thought I think the approach you've picked is probably
better, because it's more like what we do already. We have existing
precedent for reparameterizing a path, but none for allowing a Var for
one relation (the parent) to in effect refer to another relation (the
child).

Right. If we could use parent Vars to indicate parent Var or child Var
depending upon the context, a lot of memory issues would be solved; we
wouldn't need to translate a single expression. But I think that's not
straight forward. I have been thinking about some kind of polymorphic
Var node, but it seems a lot more invasive change. Although, if we
could get something like that, we would save a huge memory. :)

That having been said, having try_nestloop_path() perform the
reparameterization at the very top of the function seems quite
undesirable. You're creating a new path there before you know whether
it's going to be rejected by the invalid-parameterization test and
also before you know whether initial_cost_nestloop is going to reject
it. It would be much better if you could find a way to postpone the
reparameterization until after those steps, and only do it if you're
going to try add_path().

Hmm. I think we can do that by refactoring
calc_nestloop_required_outer(), allow_star_schema_join() and
have_dangerous_phv() to use relids instead of paths. If the checks
pass for a join between parents, those should pass for joins between
children. Done in the attached set of patches.

try_nestloop_path has few new variables. Among those innerrelids and
outerrelids indicate the relids to be used by the parameterization
checks (see patch for details). They are not relids of inner and outer
relations resp. but kind of effective relids to be used. But I
couldn't come up with better names which convey proper meaning and
still are short enough. effective_innerrelids is mouthful.

I am wondering whether we need to change
calc_non_nestloop_required_outer() similar to
calc_nestloop_required_outer() just to keep their signatures in sync.

Should I work on completing reparamterized_path_by_child() to support
all kinds of paths?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v8.zipapplication/zip; name=pg_dp_join_patches_v8.zip
#66Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Robert Haas (#63)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 8:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 17, 2017 at 8:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:

While I was studying what you did with reparameterize_path_by_child(),
I started to wonder whether reparameterize_path() doesn't need to
start handling join paths. I think it only handles scan paths right
now because that's the only thing that can appear under an appendrel
created by inheritance expansion, but you're changing that. Maybe
it's not critical -- I think the worst consequences of missing some
handling there is that we won't consider a parameterized path in some
case where it would be advantageous to do so. Still, you might want
to investigate a bit.

I spent a fair amount of time this weekend musing over
reparameterize_path_by_child(). I think a key question for this patch
- as you already pointed out - is whether we're happy with that
approach. When we discover that we want to perform a partitionwise
parameterized nestloop, and therefore that we need the paths for each
inner appendrel to get their input values from the corresponding outer
appendrel members rather than from the outer parent, we've got two
choices. The first is to do what the patch actually does, which is to
build a new path tree for the nestloop inner path parameterized by the
appropriate childrel. The second is to use the existing paths, which
are parameterized by the parent rel, and then somehow allow make that
work. For example, you can imagine that create_plan_recurse() could
pass down a list of parameterized nestloops above the current point in
the path tree, and a parent-child mapping for each, and then we could
try to substitute everything while actually generating the plan
instead of creating paths sooner. Which is better?

It would be nice to hear opinions from anyone else who cares, but
after some thought I think the approach you've picked is probably
better, because it's more like what we do already. We have existing
precedent for reparameterizing a path, but none for allowing a Var for
one relation (the parent) to in effect refer to another relation (the
child).

That having been said, having try_nestloop_path() perform the
reparameterization at the very top of the function seems quite
undesirable. You're creating a new path there before you know whether
it's going to be rejected by the invalid-parameterization test and
also before you know whether initial_cost_nestloop is going to reject
it. It would be much better if you could find a way to postpone the
reparameterization until after those steps, and only do it if you're
going to try add_path().

On a further testing of this patch I find another case when it is
showing regression, the time taken with patch is around 160 secs and
without it is 125 secs.
Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables..
Please find the attached .out file to check the output I witnessed and
let me know if anymore information is required
Schema and data was similar to the preciously shared schema with the
addition of more data for this case, parameter settings used were:
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

pwj_regress_2.outapplication/octet-stream; name=pwj_regress_2.out
#67Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#64)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 9:44 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I believe it would also be best to include 0011's changes to
adjust_appendrel_attrs_multilevel in 0001.

The function needs to repeat the "adjustment" process for every
"other" relation (join or base) that it encounters, by testing using
OTHER_BASE_REL or OTHER_JOINREL in short IS_OTHER_REL(). The last
macros are added by the partition-wise join implementation patch 0005.
It doesn't make sense to add that macro in 0001 OR modify that
function twice, once in 0001 and then after 0005. So, I will leave it
to be part of 0011, where the changes are actually needed.

Hmm. I would kind of like to move the IS_JOIN_REL() and
IS_OTHER_REL() stuff to the front of the series. In other words, I
propose that we add those macros first, each testing for only the one
kind of RelOptInfo that exists today, and change all the code to use
them. Then, when we add child joinrels, we can modify the macros at
the same time. The problem with doing it the way you have it is that
those changes will have to be squashed into the main partitionwise
join commit, because otherwise stuff will be broken. Doing it the
other way around lets us commit that bit separately.

Done. Now SQL file has 325 lines and output has 1697 lines as against
515 and 4085 lines resp. earlier.

Sounds reasonable.

Now that that purpose has served, I have reordered the
patches so that test patch comes after the implementation and follow
on fixes.

Sounds good.

There are two ways to fix it,

1. when we create a reparameterized path add it to the list of paths,
thus the parameterization bubbles up the join tree. But then we will
be changing the path list after set_cheapest() has been called OR may
be throwing out paths which other paths refer to. That's not
desirable. May be we can save this path in another list and create
join paths using this path instead of reparameterizing existing join
paths.
2. Add code to reparameterize_path() to handle join paths, and I think
all kinds of paths since we might have trickle the parameterization
down the joining paths which could be almost anything including
sort_paths, unique_paths etc. That looks like a significant effort. I
think, we should attack it separately after the stock partition-wise
join has been committed.

I don't understand #1. #2 sounds like what I was expecting. I agree
it can be postponed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#65)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 9:44 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Right. If we could use parent Vars to indicate parent Var or child Var
depending upon the context, a lot of memory issues would be solved; we
wouldn't need to translate a single expression. But I think that's not
straight forward. I have been thinking about some kind of polymorphic
Var node, but it seems a lot more invasive change. Although, if we
could get something like that, we would save a huge memory. :)

Yes, that's why I'm interested in exploring that approach once the
basic framework is in place here.

I am wondering whether we need to change
calc_non_nestloop_required_outer() similar to
calc_nestloop_required_outer() just to keep their signatures in sync.

I haven't looked at the patch, but I don't think you need to worry about that.

Should I work on completing reparamterized_path_by_child() to support
all kinds of paths?

Yes, or at the very least all scans, like reparameterize_path() already does.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Rafia Sabih (#66)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On a further testing of this patch I find another case when it is
showing regression, the time taken with patch is around 160 secs and
without it is 125 secs.

This is basically the same problem as before; the partitionwise case
is doing the hash joins with the sides flipped from the optimal
strategy. I bet that's a bug in the code rather than a problem with
the concept.

Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables..

Yes, this is definitely going to use significant additional planning
time and memory. There are several possible strategies for improving
that situation, but I think we need to get the basics in place first.
That's why the proposal is now to have this turned off by default.
People joining really big tables that happen to be equipartitioned are
likely to want to turn it on, though, even before those optimizations
are done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#66)
Re: Partition-wise join for join between (declaratively) partitioned tables

On a further testing of this patch I find another case when it is
showing regression, the time taken with patch is around 160 secs and
without it is 125 secs.
Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables.

Right, planning time would be proportional to the number of partitions
at least in the first version. We may improve upon it later.

Please find the attached .out file to check the output I witnessed and
let me know if anymore information is required
Schema and data was similar to the preciously shared schema with the
addition of more data for this case, parameter settings used were:
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1

The patch does not introduce any new costing model. It costs the
partition-wise join as sum of costs of joins between partitions. The
method to create the paths for joins between partitions is same as
creating the paths for joins between regular tables and then the
method to collect paths across partition-wise joins is same as
collecting paths across child base relations. So, there is a large
chance that the costing for joins between partitions might have a
problem which is showing up here. There may be some special handling
for regular tables versus child tables that may be the root cause. But
I have not seen that kind of code till now.

Can you please provide the outputs of individual partition-joins? If
the plans for joins between partitions are same as the ones chosen for
partition-wise joins, we may need to fix the existing join cost
models.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#67)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hmm. I would kind of like to move the IS_JOIN_REL() and
IS_OTHER_REL() stuff to the front of the series. In other words, I
propose that we add those macros first, each testing for only the one
kind of RelOptInfo that exists today, and change all the code to use
them. Then, when we add child joinrels, we can modify the macros at
the same time. The problem with doing it the way you have it is that
those changes will have to be squashed into the main partitionwise
join commit, because otherwise stuff will be broken. Doing it the
other way around lets us commit that bit separately.

I can provide a patch with adjust_appendrel_attrs_multilevel() changed
to child-joins, which can be applied before multi-level
partitioin-wise support patch but after partition-wise implementation
patch. You may consider applying that patch separately before
multi-level partition-wise support, in case we see that multi-level
partition-wise join support can be committed. Does that sound good?
That way we save changing those macros twice.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#71)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 12:52 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Hmm. I would kind of like to move the IS_JOIN_REL() and
IS_OTHER_REL() stuff to the front of the series. In other words, I
propose that we add those macros first, each testing for only the one
kind of RelOptInfo that exists today, and change all the code to use
them. Then, when we add child joinrels, we can modify the macros at
the same time. The problem with doing it the way you have it is that
those changes will have to be squashed into the main partitionwise
join commit, because otherwise stuff will be broken. Doing it the
other way around lets us commit that bit separately.

I can provide a patch with adjust_appendrel_attrs_multilevel() changed
to child-joins, which can be applied before multi-level
partitioin-wise support patch but after partition-wise implementation
patch. You may consider applying that patch separately before
multi-level partition-wise support, in case we see that multi-level
partition-wise join support can be committed. Does that sound good?
That way we save changing those macros twice.

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#72)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 10:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 20, 2017 at 12:52 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Hmm. I would kind of like to move the IS_JOIN_REL() and
IS_OTHER_REL() stuff to the front of the series. In other words, I
propose that we add those macros first, each testing for only the one
kind of RelOptInfo that exists today, and change all the code to use
them. Then, when we add child joinrels, we can modify the macros at
the same time. The problem with doing it the way you have it is that
those changes will have to be squashed into the main partitionwise
join commit, because otherwise stuff will be broken. Doing it the
other way around lets us commit that bit separately.

I can provide a patch with adjust_appendrel_attrs_multilevel() changed
to child-joins, which can be applied before multi-level
partitioin-wise support patch but after partition-wise implementation
patch. You may consider applying that patch separately before
multi-level partition-wise support, in case we see that multi-level
partition-wise join support can be committed. Does that sound good?
That way we save changing those macros twice.

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
the code to use it will look quite odd by itself. We are not changing
all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
those. There is code which needs to check those kinds, instead of "all
join rels" or "all other rels" resp. So the patch will add those
macros, change only few places to use those macros, which are intended
to be changed while applying partition-wise join support for single
level partitioned table.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#73)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
the code to use it will look quite odd by itself. We are not changing
all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
those. There is code which needs to check those kinds, instead of "all
join rels" or "all other rels" resp. So the patch will add those
macros, change only few places to use those macros, which are intended
to be changed while applying partition-wise join support for single
level partitioned table.

Hmm. You might be right, but I'm not convinced.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Robert Haas (#74)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have created some test to cover partition wise joins with
postgres_fdw, also verified make check.
patch attached.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

postgres_fdw_pwj.patchtext/x-patch; charset=US-ASCII; name=postgres_fdw_pwj.patch
#76Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#70)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 10:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On a further testing of this patch I find another case when it is
showing regression, the time taken with patch is around 160 secs and
without it is 125 secs.
Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables.

Right, planning time would be proportional to the number of partitions
at least in the first version. We may improve upon it later.

Please find the attached .out file to check the output I witnessed and
let me know if anymore information is required
Schema and data was similar to the preciously shared schema with the
addition of more data for this case, parameter settings used were:
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1

this doesn't look good. Why do you set both these costs to the same value?

The patch does not introduce any new costing model. It costs the
partition-wise join as sum of costs of joins between partitions. The
method to create the paths for joins between partitions is same as
creating the paths for joins between regular tables and then the
method to collect paths across partition-wise joins is same as
collecting paths across child base relations. So, there is a large
chance that the costing for joins between partitions might have a
problem which is showing up here. There may be some special handling
for regular tables versus child tables that may be the root cause. But
I have not seen that kind of code till now.

Can you please provide the outputs of individual partition-joins? If
the plans for joins between partitions are same as the ones chosen for
partition-wise joins, we may need to fix the existing join cost
models.

Offlist, Rafia shared the outputs of joins between partitions and join
between partitioned table. The joins between partitions look similar
to those pick up by the partition-wise join. So, it looks that some
costing error in regular joins is resulting in an costing error in
partition-wise join as suspected. Attached the SQL and the output.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

queries.outapplication/octet-stream; name=queries.out
queries.sqlapplication/octet-stream; name=queries.sql
#77Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#74)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
the code to use it will look quite odd by itself. We are not changing
all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
those. There is code which needs to check those kinds, instead of "all
join rels" or "all other rels" resp. So the patch will add those
macros, change only few places to use those macros, which are intended
to be changed while applying partition-wise join support for single
level partitioned table.

Hmm. You might be right, but I'm not convinced.

Ok. changed as per your request in the latest set of patches.

There are some more changes as follows
1. In the earlier patch set the changes related to
calc_nestloop_required_outer() and related functions were spread
across multiple patches. That was unintentional. This patch set has
all those changes in a single patch.

2. Rajkumar reported a crash offlist. When one of the joining
multi-level partitioned relations is empty, an assertion in
try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels);
failed since it didn't find part_rels for a subpartition. The problem
here is set_append_rel_size() does not call set_rel_size() and hence
set_append_rel_size() if a child is found to be empty, a scenario
described in [1]CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com, subject "Asymmetry between parent and child wrt "false" quals". It's the later one which sets the part_rels for a
partitioned relation and hence the subpartitions do not get part_rels
since set_append_rel_size() is never called for those. Generally, if a
partitioned relation is found to be empty before we set part_rels, we
may not want to spend time in creating/collecting child RelOptInfos,
since they will be empty anyway. If part_rels isn't present,
part_scheme doesn't make sense. So an empty partitioned table without
any partitions can be treated as unpartitioned. So, I have fixed
set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a
relation empty, to reset partition scheme when those conditions are
met. This fix is included as a separate patch. Let me know if this
looks good to you.

3. I am in the process of completing reparameterize_paths_by_child()
by adding all possible paths. I have restructured the function to look
better and have one switch case instead of two. Also added more path
types including ForeignPath, for which I have added a FDW hook, with
documentation, for handling fdw_private. Please let me know if this
looks good to you. I am thinking of similar hook for CustomPath. I
will continue to add more path types to
reparameterize_path_by_child().

I am wondering whether we should bring 0007 he patche adjusting code
to work with child-joins before 0006, partition-wise join. 0006 needs
it, but 0007 doesn't depend upon 0006. Will that be any better?

[1]: CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com, subject "Asymmetry between parent and child wrt "false" quals"
subject "Asymmetry between parent and child wrt "false" quals"

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v9.zipapplication/zip; name=pg_dp_join_patches_v9.zip
#78Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#75)
Re: Partition-wise join for join between (declaratively) partitioned tables

Thanks Rajkumar. Added those in the latest set of patches.

On Tue, Mar 21, 2017 at 3:52 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have created some test to cover partition wise joins with
postgres_fdw, also verified make check.
patch attached.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#76)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 21, 2017 at 7:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Mar 20, 2017 at 10:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On a further testing of this patch I find another case when it is
showing regression, the time taken with patch is around 160 secs and
without it is 125 secs.
Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables.

Right, planning time would be proportional to the number of partitions
at least in the first version. We may improve upon it later.

Please find the attached .out file to check the output I witnessed and
let me know if anymore information is required
Schema and data was similar to the preciously shared schema with the
addition of more data for this case, parameter settings used were:
work_mem = 1GB
random_page_cost = seq_page_cost = 0.1

this doesn't look good. Why do you set both these costs to the same value?

That's a perfectly reasonable configuration if the data is in memory
on a medium with fast random access, like an SSD.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#77)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Here's the set of patches rebased on latest head, which also has a
commit to eliminate scans on partitioned tables. This change has
caused problems with multi-level partitioned tables, that I have not
fixed in this patch set. Also a couple of partition-wise join plans
for single-level partitioned tables have changed to non-partition-wise
joins. I haven't fixed those as well.

I have added a separate patch to fix add_paths_to_append_rel() to
collect partitioned_rels list for join relations. Please let me know
if this looks good. I think it needs to be merged into some other
patch, but I am not sure which. Probably we should just treat it as
another refactoring patch.

On Tue, Mar 21, 2017 at 5:16 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
the code to use it will look quite odd by itself. We are not changing
all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
those. There is code which needs to check those kinds, instead of "all
join rels" or "all other rels" resp. So the patch will add those
macros, change only few places to use those macros, which are intended
to be changed while applying partition-wise join support for single
level partitioned table.

Hmm. You might be right, but I'm not convinced.

Ok. changed as per your request in the latest set of patches.

There are some more changes as follows
1. In the earlier patch set the changes related to
calc_nestloop_required_outer() and related functions were spread
across multiple patches. That was unintentional. This patch set has
all those changes in a single patch.

2. Rajkumar reported a crash offlist. When one of the joining
multi-level partitioned relations is empty, an assertion in
try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels);
failed since it didn't find part_rels for a subpartition. The problem
here is set_append_rel_size() does not call set_rel_size() and hence
set_append_rel_size() if a child is found to be empty, a scenario
described in [1]. It's the later one which sets the part_rels for a
partitioned relation and hence the subpartitions do not get part_rels
since set_append_rel_size() is never called for those. Generally, if a
partitioned relation is found to be empty before we set part_rels, we
may not want to spend time in creating/collecting child RelOptInfos,
since they will be empty anyway. If part_rels isn't present,
part_scheme doesn't make sense. So an empty partitioned table without
any partitions can be treated as unpartitioned. So, I have fixed
set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a
relation empty, to reset partition scheme when those conditions are
met. This fix is included as a separate patch. Let me know if this
looks good to you.

3. I am in the process of completing reparameterize_paths_by_child()
by adding all possible paths. I have restructured the function to look
better and have one switch case instead of two. Also added more path
types including ForeignPath, for which I have added a FDW hook, with
documentation, for handling fdw_private. Please let me know if this
looks good to you. I am thinking of similar hook for CustomPath. I
will continue to add more path types to
reparameterize_path_by_child().

I am wondering whether we should bring 0007 he patche adjusting code
to work with child-joins before 0006, partition-wise join. 0006 needs
it, but 0007 doesn't depend upon 0006. Will that be any better?

[1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com,
subject "Asymmetry between parent and child wrt "false" quals"

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v11.zipapplication/zip; name=pg_dp_join_patches_v11.zip
#81Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#80)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 21, 2017 at 10:40 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's the set of patches rebased on latest head, which also has a
commit to eliminate scans on partitioned tables. This change has
caused problems with multi-level partitioned tables, that I have not
fixed in this patch set. Also a couple of partition-wise join plans
for single-level partitioned tables have changed to non-partition-wise
joins. I haven't fixed those as well.

I have added a separate patch to fix add_paths_to_append_rel() to
collect partitioned_rels list for join relations. Please let me know
if this looks good. I think it needs to be merged into some other
patch, but I am not sure which. Probably we should just treat it as
another refactoring patch.

On Tue, Mar 21, 2017 at 5:16 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Mar 20, 2017 at 11:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 20, 2017 at 1:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

That seems different than what I suggested and I'm not sure what the
reason is for the difference?

The patch adding macros IS_JOIN_REL() and IS_OTHER_REL() and changing
the code to use it will look quite odd by itself. We are not changing
all the instances of RELOPT_JOINREL or RELOPT_OTHER_MEMBER_REL to use
those. There is code which needs to check those kinds, instead of "all
join rels" or "all other rels" resp. So the patch will add those
macros, change only few places to use those macros, which are intended
to be changed while applying partition-wise join support for single
level partitioned table.

Hmm. You might be right, but I'm not convinced.

Ok. changed as per your request in the latest set of patches.

There are some more changes as follows
1. In the earlier patch set the changes related to
calc_nestloop_required_outer() and related functions were spread
across multiple patches. That was unintentional. This patch set has
all those changes in a single patch.

2. Rajkumar reported a crash offlist. When one of the joining
multi-level partitioned relations is empty, an assertion in
try_partition_wise_join() Assert(rel1->part_rels && rel2->part_rels);
failed since it didn't find part_rels for a subpartition. The problem
here is set_append_rel_size() does not call set_rel_size() and hence
set_append_rel_size() if a child is found to be empty, a scenario
described in [1]. It's the later one which sets the part_rels for a
partitioned relation and hence the subpartitions do not get part_rels
since set_append_rel_size() is never called for those. Generally, if a
partitioned relation is found to be empty before we set part_rels, we
may not want to spend time in creating/collecting child RelOptInfos,
since they will be empty anyway. If part_rels isn't present,
part_scheme doesn't make sense. So an empty partitioned table without
any partitions can be treated as unpartitioned. So, I have fixed
set_dummy_rel_pathlist() and mark_dummy_rel(), the functions setting a
relation empty, to reset partition scheme when those conditions are
met. This fix is included as a separate patch. Let me know if this
looks good to you.

3. I am in the process of completing reparameterize_paths_by_child()
by adding all possible paths. I have restructured the function to look
better and have one switch case instead of two. Also added more path
types including ForeignPath, for which I have added a FDW hook, with
documentation, for handling fdw_private. Please let me know if this
looks good to you. I am thinking of similar hook for CustomPath. I
will continue to add more path types to
reparameterize_path_by_child().

I am wondering whether we should bring 0007 he patche adjusting code
to work with child-joins before 0006, partition-wise join. 0006 needs
it, but 0007 doesn't depend upon 0006. Will that be any better?

[1] CAFjFpRcdrdsCRDbBu0J2pxwWbhb_sDWQUTVznBy_4XGr-p3+wA@mail.gmail.com,
subject "Asymmetry between parent and child wrt "false" quals"

In an attempt to test the geqo side of this patch, I reduced
geqo_threshold to 6 and set enable_partitionwise_join to to true and
tried following query, which crashed,

explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a
= prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000
order by prt.a desc;

Stack-trace for the crash is as follows,

Program received signal SIGSEGV, Segmentation fault.
0x00000000007a43d1 in find_param_path_info (rel=0x2d3fe30,
required_outer=0x2ff6d30) at relnode.c:1534
1534 if (bms_equal(ppi->ppi_req_outer, required_outer))
(gdb) bt
#0 0x00000000007a43d1 in find_param_path_info (rel=0x2d3fe30,
required_outer=0x2ff6d30) at relnode.c:1534
#1 0x000000000079b8bb in reparameterize_path_by_child
(root=0x2df7550, path=0x2f6dec0, child_rel=0x2d4a860) at
pathnode.c:3455
#2 0x000000000075be30 in try_nestloop_path (root=0x2df7550,
joinrel=0x2ff51b0, outer_path=0x2f96540, inner_path=0x2f6dec0,
pathkeys=0x0,
jointype=JOIN_INNER, extra=0x7fffe6b4e130) at joinpath.c:344
#3 0x000000000075d55b in match_unsorted_outer (root=0x2df7550,
joinrel=0x2ff51b0, outerrel=0x2d4a860, innerrel=0x2d3fe30,
jointype=JOIN_INNER,
extra=0x7fffe6b4e130) at joinpath.c:1389
#4 0x000000000075bc5f in add_paths_to_joinrel (root=0x2df7550,
joinrel=0x2ff51b0, outerrel=0x2d4a860, innerrel=0x2d3fe30,
jointype=JOIN_INNER,
sjinfo=0x3076bc8, restrictlist=0x3077168) at joinpath.c:234
#5 0x000000000075f1d5 in populate_joinrel_with_paths (root=0x2df7550,
rel1=0x2d3fe30, rel2=0x2d4a860, joinrel=0x2ff51b0, sjinfo=0x3076bc8,
restrictlist=0x3077168) at joinrels.c:793
#6 0x0000000000760107 in try_partition_wise_join (root=0x2df7550,
rel1=0x2d3f6d8, rel2=0x2d4a1a0, joinrel=0x30752f0,
parent_sjinfo=0x7fffe6b4e2d0,
parent_restrictlist=0x3075768) at joinrels.c:1401
#7 0x000000000075f0e6 in make_join_rel (root=0x2df7550,
rel1=0x2d3f6d8, rel2=0x2d4a1a0) at joinrels.c:744
#8 0x0000000000742053 in merge_clump (root=0x2df7550,
clumps=0x3075270, new_clump=0x30752a8, force=0 '\000') at
geqo_eval.c:260
#9 0x0000000000741f1c in gimme_tree (root=0x2df7550, tour=0x2ff2430,
num_gene=6) at geqo_eval.c:199
#10 0x0000000000741df5 in geqo_eval (root=0x2df7550, tour=0x2ff2430,
num_gene=6) at geqo_eval.c:102
#11 0x000000000074288a in random_init_pool (root=0x2df7550,
pool=0x2ff23d0) at geqo_pool.c:109
#12 0x00000000007422a6 in geqo (root=0x2df7550, number_of_rels=6,
initial_rels=0x2ff22d0) at geqo_main.c:114
#13 0x0000000000747f19 in make_rel_from_joinlist (root=0x2df7550,
joinlist=0x2dce940) at allpaths.c:2333
#14 0x0000000000744e7e in make_one_rel (root=0x2df7550,
joinlist=0x2dce940) at allpaths.c:182
#15 0x0000000000772df9 in query_planner (root=0x2df7550,
tlist=0x2dec2c0, qp_callback=0x777ce1 <standard_qp_callback>,
qp_extra=0x7fffe6b4e700)
at planmain.c:254

Please let me know if any more information is required on this.
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#81)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

In an attempt to test the geqo side of this patch, I reduced
geqo_threshold to 6 and set enable_partitionwise_join to to true and
tried following query, which crashed,

explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a
= prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000
order by prt.a desc;

Stack-trace for the crash is as follows,

Nice catch. When reparameterize_path_by_child() may be running in a
temporary memory context while running in GEQO mode. It may add a new
PPI to base relation all in the temporary context. In the next GEQO
cycle, the ppilist will be clobbered since the temporary context is
reset for each geqo cycle. The fix is to allocate PPI in the same
memory context as the RelOptInfo similar to mark_dummy_rel().

I also found another problem. In geqo, we never call
generate_partition_wise_join_paths() which set cheapest paths for each
child-join. Because of this cheapest_*_paths are never set for those
rels, thus segfaulting in functions like sort_inner_and_outer() which
use those.

Here's patch fixing both the issues. Please let me know if it fixes
the issues you are seeing.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

geqo_fix.patchapplication/octet-stream; name=geqo_fix.patch
#83Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#82)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 22, 2017 at 3:19 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

In an attempt to test the geqo side of this patch, I reduced
geqo_threshold to 6 and set enable_partitionwise_join to to true and
tried following query, which crashed,

explain select * from prt, prt2, prt3, prt32, prt4, prt42 where prt.a
= prt2.b and prt3.a = prt32.b and prt4.a = prt42.b and prt2.a > 1000
order by prt.a desc;

Stack-trace for the crash is as follows,

Nice catch. When reparameterize_path_by_child() may be running in a
temporary memory context while running in GEQO mode. It may add a new
PPI to base relation all in the temporary context. In the next GEQO
cycle, the ppilist will be clobbered since the temporary context is
reset for each geqo cycle. The fix is to allocate PPI in the same
memory context as the RelOptInfo similar to mark_dummy_rel().

I also found another problem. In geqo, we never call
generate_partition_wise_join_paths() which set cheapest paths for each
child-join. Because of this cheapest_*_paths are never set for those
rels, thus segfaulting in functions like sort_inner_and_outer() which
use those.

Here's patch fixing both the issues. Please let me know if it fixes
the issues you are seeing.

I tested the applied patch, it is fixing the reported issue.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#82)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Here's set of updated patches rebased on
1148e22a82edc96172fc78855da392b6f0015c88.

I have fixed all the issues reported till now.

I have also completed reparameterize_path_by_child() for all the
required paths. There's no TODO there now. :) The function has grown
quite long now and might take some time to review. Given the size, I
am wondering whether we should separate that fix from the main
partition-wise join fix. That will make reviewing that function
easier, allowing a careful review. Here's the idea how that can be
done. As explained in the commit of 0009, the function is required in
case of lateral joins between partitioned relations. For a A LATERAL
JOIN B, B is the minimum required parameterization by A. Hence
children of A i.e. A1, A2 ... all require their paths to be
parameterized by B. When that comes to partition-wise joins, A1
requires its paths to be parameterized by B1 (matching partition from
B). Otherwise we can not create paths for A1B1. This means that we
require to reparameterize all A1's paths to be reparameterized by B1
using function reparameterize_paths_by_child(). So the function needs
to support reparameterization of all the paths; we do not know which
of those have survived add_path(). But if we disable partition-wise
join for lateral joins i.e. when direct_lateral_relids of one relation
contains the any subset of the relids in the other relation, we do not
need reparameterize_path_by_child(). Please let me know if this
strategy will help to make review and commit easier.

After the commit,
commit d3cc37f1d801a6b5cad9bf179274a8d767f1ee50
Author: Robert Haas <rhaas@postgresql.org>
Date: Tue Mar 21 09:48:04 2017 -0400

Don't scan partitioned tables.

We do not create any AppendRelInfos and hence RelOptInfos for the
partitioned tables. My approach to attach multi-level partitioned join
was to store RelOptInfos of immediate partitions in part_rels of
RelOptInfo of a partitioned table, thus maintaining a tree of
RelOptInfos reflecting partitioning tree. This allows to add append
paths to intermediate RelOptInfos, flattening them as we go up the
partitioning hierarchy. With no RelOptInfos for intermediate
partitions, we can support multi-level partition-wise join only in
limited cases when the partitioning hierarchy of the joining table
exactly matches. Please refer [1]/messages/by-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV=kB8Akkqg@mail.gmail.com for some more discussion.

I think we need the RelOptInfos for the partitions, which are
partitioned to hold the "append" paths containing paths from their
children and to match the partitions in partition-wise join. Similar
hierarchy will be created for partitioned joins, with partitioned join
partitions. So, I have not changed the multi-level partition-wise join
support patches. After applying 0011-0013 the multi-level partitioning
tests fail with error "could not find the RelOptInfo of a partition
with oid", since it does not find the RelOptInfos of partitions which
are partitioned.

[1]: /messages/by-id/CAFjFpRceMmx26653XFAYvc5KVQcrzcKScVFqZdbXV=kB8Akkqg@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v12.zipapplication/zip; name=pg_dp_join_patches_v12.zip
#85Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#83)
Re: Partition-wise join for join between (declaratively) partitioned tables

I tested the applied patch, it is fixing the reported issue.

Thanks for the confirmation Rafia. I have included the fix in the
latest set of patches.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#84)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have also completed reparameterize_path_by_child() for all the
required paths. There's no TODO there now. :) The function has grown
quite long now and might take some time to review. Given the size, I
am wondering whether we should separate that fix from the main
partition-wise join fix. That will make reviewing that function
easier, allowing a careful review. Here's the idea how that can be
done. As explained in the commit of 0009, the function is required in
case of lateral joins between partitioned relations. For a A LATERAL
JOIN B, B is the minimum required parameterization by A. Hence
children of A i.e. A1, A2 ... all require their paths to be
parameterized by B. When that comes to partition-wise joins, A1
requires its paths to be parameterized by B1 (matching partition from
B). Otherwise we can not create paths for A1B1. This means that we
require to reparameterize all A1's paths to be reparameterized by B1
using function reparameterize_paths_by_child(). So the function needs
to support reparameterization of all the paths; we do not know which
of those have survived add_path(). But if we disable partition-wise
join for lateral joins i.e. when direct_lateral_relids of one relation
contains the any subset of the relids in the other relation, we do not
need reparameterize_path_by_child(). Please let me know if this
strategy will help to make review and commit easier.

In my testing last week, reparameterize_path_by_child() was essential
for nested loops to work properly, even without LATERAL. Without it,
the parameterized path ends up containing vars that reference the
parent varno instead of the child varno. That confused later planner
stages so that those Vars did not get replaced with Param during
replace_nestloop_params(), eventually resulting in a crash at
execution time. Based on that experiment, I think we could consider
having reparameterize_path_by_child() handle only scan paths as
reparameterize_path() does, and just give up on plans like this:

Append
-> Left Join
-> Scan on a
-> Inner Join
-> Index Scan on b
-> Index Scan on c
[repeat for each partition]

But I doubt we can get by without it altogether.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#86)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 22, 2017 at 6:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have also completed reparameterize_path_by_child() for all the
required paths. There's no TODO there now. :) The function has grown
quite long now and might take some time to review. Given the size, I
am wondering whether we should separate that fix from the main
partition-wise join fix. That will make reviewing that function
easier, allowing a careful review. Here's the idea how that can be
done. As explained in the commit of 0009, the function is required in
case of lateral joins between partitioned relations. For a A LATERAL
JOIN B, B is the minimum required parameterization by A. Hence
children of A i.e. A1, A2 ... all require their paths to be
parameterized by B. When that comes to partition-wise joins, A1
requires its paths to be parameterized by B1 (matching partition from
B). Otherwise we can not create paths for A1B1. This means that we
require to reparameterize all A1's paths to be reparameterized by B1
using function reparameterize_paths_by_child(). So the function needs
to support reparameterization of all the paths; we do not know which
of those have survived add_path(). But if we disable partition-wise
join for lateral joins i.e. when direct_lateral_relids of one relation
contains the any subset of the relids in the other relation, we do not
need reparameterize_path_by_child(). Please let me know if this
strategy will help to make review and commit easier.

In my testing last week, reparameterize_path_by_child() was essential
for nested loops to work properly, even without LATERAL. Without it,
the parameterized path ends up containing vars that reference the
parent varno instead of the child varno. That confused later planner
stages so that those Vars did not get replaced with Param during
replace_nestloop_params(), eventually resulting in a crash at
execution time.

I half-described the solution. Sorry. Along-with disabling
partition-wise lateral joins, we have to disable nested loop
child-joins where inner child is parameterized by the parent of the
outer one. We will still have nestloop join between parents where
inner relation is parameterized by the outer and every child of inner
is parameterized by the outer. But we won't create nest loop joins
where inner child is parameterized by the outer child, where we
require reparameterize_path_by_child. We will loose this optimization
only till we get reparameterize_path_by_child() committed. Basically,
in try_nestloop_path() (in the patch 0009), if
(PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up
creating nest loop path. That shouldn't create any problems.

Did you experiment with this change in try_nestloop_path()? Can you
please share the testcase? I will take a look at it.

Based on that experiment, I think we could consider
having reparameterize_path_by_child() handle only scan paths as
reparameterize_path() does, and just give up on plans like this:

Append
-> Left Join
-> Scan on a
-> Inner Join
-> Index Scan on b
-> Index Scan on c
[repeat for each partition]

I am assuming that a, b and c are partitions of A, B and C resp. which
are being joined and both or one of the scans on b and c are
parameteried by a or scan of c is parameterized by b.

I don't think we will get away by supporting just scan paths, since
the inner side of lateral join can be any paths not just scan path. Or
you are suggesting that we disable partition-wise lateral join and
support reparameterization of only scan paths?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#87)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 22, 2017 at 9:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

In my testing last week, reparameterize_path_by_child() was essential
for nested loops to work properly, even without LATERAL. Without it,
the parameterized path ends up containing vars that reference the
parent varno instead of the child varno. That confused later planner
stages so that those Vars did not get replaced with Param during
replace_nestloop_params(), eventually resulting in a crash at
execution time.

I half-described the solution. Sorry. Along-with disabling
partition-wise lateral joins, we have to disable nested loop
child-joins where inner child is parameterized by the parent of the
outer one. We will still have nestloop join between parents where
inner relation is parameterized by the outer and every child of inner
is parameterized by the outer. But we won't create nest loop joins
where inner child is parameterized by the outer child, where we
require reparameterize_path_by_child. We will loose this optimization
only till we get reparameterize_path_by_child() committed. Basically,
in try_nestloop_path() (in the patch 0009), if
(PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up
creating nest loop path. That shouldn't create any problems.

Did you experiment with this change in try_nestloop_path()? Can you
please share the testcase? I will take a look at it.

I didn't save the test case. It was basically just forcing a
partitionwise nestloop join between two equipartitioned tables, with
the calls to adjust_appendrel_attrs() ripped out of
reparameterize_path_by_child(), just to see what would break.

Based on that experiment, I think we could consider
having reparameterize_path_by_child() handle only scan paths as
reparameterize_path() does, and just give up on plans like this:

Append
-> Left Join
-> Scan on a
-> Inner Join
-> Index Scan on b
-> Index Scan on c
[repeat for each partition]

I am assuming that a, b and c are partitions of A, B and C resp. which
are being joined and both or one of the scans on b and c are
parameteried by a or scan of c is parameterized by b.

Yes.

I don't think we will get away by supporting just scan paths, since
the inner side of lateral join can be any paths not just scan path. Or
you are suggesting that we disable partition-wise lateral join and
support reparameterization of only scan paths?

I think if you can do a straight-up partitionwise nested loop between
two tables A and B, that's pretty bad. But if there are more complex
cases that involve parameterizing entire join trees which aren't
covered, that's less bad. Parallel query almost entirely punts on
LATERAL right now, and nobody's complained yet. I'm sure that'll need
to get fixed someday, but not today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#84)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of updated patches rebased on
1148e22a82edc96172fc78855da392b6f0015c88.

I have fixed all the issues reported till now.

I don't understand why patch 0001 ends up changing every existing test
for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(),
yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up
getting changed to use IS_OTHER_REL(). That's very surprising. Some
of those tests are essentially checking for something that is going to
have a scan plan rather than a join or upper plan, and those tests
probably don't need to be modified; for example, the test in
set_rel_consider_parallel() is obviously of this type. But others are
testing whether we've got some kind of child rel, and those seem like
they might need work. Going through a few specific examples:

- generate_join_implied_equalities_for_ecs() assumes that any child
rel is an other member rel.
- generate_join_implied_equalities_broken() assumes that any child rel
is an other member rel.
- generate_implied_equalities_for_column() set is_child_rel on the
assumption that only an other member rel can be a child rel.
- eclass_useful_for_merging() assumes that the only kind of child rel
is an other member rel.
- find_childrel_appendrelinfo() assumes that any child rel is an other
member rel.
- find_childrel_top_parent() and find_childrel_parents() assume that
children must be other member rels and their parents must be baserels.
- adjust_appendrel_attrs_multilevel() assumes that children must be
other member rels and their parents must be baserels.

It's possible that, for various reasons, none of these code paths
would ever be reachable by a child join, but it doesn't look likely to
me. And even if that's true, some comment updates are probably
needed, and maybe some renaming of functions too.

In postgres_fdw, get_useful_ecs_for_relation() assumes that any child
rel is an other member rel. I'm not sure if we're hoping that
partitionwise join will work with postgres_fdw's join pushdown out of
the chute, but clearly this would need to be adjusted to have any
chance of being right.

Some that seem OK:

- set_rel_consider_parallel() is fine.
- set_append_rel_size() is only going to be called for baserels or
their children, so it's fine.
- relation_excluded_by_constraints() is only intended to be called on
baserels or their children, so it's fine.
- check_index_predicates() is only intended to be called on baserels
or their children, so it's fine.
- query_planner() loops over baserels and their children, so it's fine.

Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be
used in some of these places, just for symmetry. The point is that
there are really three questions here: (1) is it some kind of baserel
(parent or child)? (2) is it some kind of joinrel (parent or child)?
and (3) is it some kind of child (baserel or join)? Right now, both
#2 and #3 are tested by just comparing against
RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we
add child joinrels. The goal of 0001, IMV, ought to be to try to
figure out which of #1, #2, and #3 is being checked in each case and
make that clear via use of an appropriate macro. (If is-other-baserel
is the real test, then fine, but I bet that's a rare case.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#88)
Re: Partition-wise join for join between (declaratively) partitioned tables

I don't think we will get away by supporting just scan paths, since
the inner side of lateral join can be any paths not just scan path. Or
you are suggesting that we disable partition-wise lateral join and
support reparameterization of only scan paths?

I think if you can do a straight-up partitionwise nested loop between
two tables A and B, that's pretty bad.

Ok.

But if there are more complex
cases that involve parameterizing entire join trees which aren't
covered, that's less bad. Parallel query almost entirely punts on
LATERAL right now, and nobody's complained yet. I'm sure that'll need
to get fixed someday, but not today.

Ok.

I am suggesting this possibility in case we run of time to review and
commit reparameterize_path_by_child() entirely. If we can do that, I
will be happy.

In case, we have to include a stripped down version of
reparameterize_path_by_child(), with which I am fine too, we will need
to disable LATERAL joins, so that we don't end up with an error "could
not devise a query plan for the given query".

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#89)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 23, 2017 at 1:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of updated patches rebased on
1148e22a82edc96172fc78855da392b6f0015c88.

I have fixed all the issues reported till now.

I don't understand why patch 0001 ends up changing every existing test
for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(),
yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up
getting changed to use IS_OTHER_REL(). That's very surprising. Some
of those tests are essentially checking for something that is going to
have a scan plan rather than a join or upper plan, and those tests
probably don't need to be modified; for example, the test in
set_rel_consider_parallel() is obviously of this type. But others are
testing whether we've got some kind of child rel, and those seem like
they might need work. Going through a few specific examples:

- generate_join_implied_equalities_for_ecs() assumes that any child
rel is an other member rel.
- generate_join_implied_equalities_broken() assumes that any child rel
is an other member rel.

Fixed those.

- generate_implied_equalities_for_column() set is_child_rel on the
assumption that only an other member rel can be a child rel.

This function is called for indexes, which are not defined on the
join relations. So, we shouldn't worry about child-joins here. I have
added an assertion in there to make sure that that function gets
called only for base and "other" member rels.

- eclass_useful_for_merging() assumes that the only kind of child rel
is an other member rel.

This was being fixed in a later patch which had many small fixes for
handling child-joins. But now I have moved that fix into 0001.

- find_childrel_appendrelinfo() assumes that any child rel is an other
member rel.

The function is called for "other" member relation only. For joins we
use find_appinfos_by_relids() We could replace
find_childrel_appendrelinfo() with find_appinfos_by_relids(), which
does same thing as find_childrel_appendrelinfo() for a relids set. But
find_appinfos_by_relids() returns a list of AppendRelInfos, hence
using it instead of find_childrel_appendrelinfo() will spend some
memory and CPU cycles in creating a one element list and then
extracting that element out of the list. So, I have not replaced
usages of find_childrel_appendrelinfo() with
find_appinfos_by_relids(). This also simplifies changes to
get_useful_ecs_for_relation().

- find_childrel_top_parent() and find_childrel_parents() assume that
children must be other member rels and their parents must be baserels.

For partition-wise join implementation we save relids of topmost
parent in RelOptInfo of child. We can directly use that instead of
calling find_childrel_top_parent(). So, in 0001 I am adding
top_parent_relids in RelOptInfo and getting rid of
find_childrel_top_parent(). This also fixes
get_useful_ecs_for_relation() in a better way. find_childrel_parents()
is called only for simple relations not joins, since it's callers are
called only for simple relations. I have added an assertion to that
effect.

- adjust_appendrel_attrs_multilevel() assumes that children must be
other member rels and their parents must be baserels.

It was being fixed in a later patch. In the attached patch set 0001
changes it to use IS_OTHER_REL().

It's possible that, for various reasons, none of these code paths
would ever be reachable by a child join, but it doesn't look likely to
me. And even if that's true, some comment updates are probably
needed, and maybe some renaming of functions too.

Now commit messages of 0001 explains which instances of
RELOPT_OTHER_MEMBER_REL and RELOPT_BASEREL have been changed, and
which have been retained and why. Also, added assertions wherever
necessary.

In postgres_fdw, get_useful_ecs_for_relation() assumes that any child
rel is an other member rel. I'm not sure if we're hoping that
partitionwise join will work with postgres_fdw's join pushdown out of
the chute, but clearly this would need to be adjusted to have any
chance of being right.

Fixed this as explained above.

Some that seem OK:

- set_rel_consider_parallel() is fine.
- set_append_rel_size() is only going to be called for baserels or
their children, so it's fine.
- relation_excluded_by_constraints() is only intended to be called on
baserels or their children, so it's fine.
- check_index_predicates() is only intended to be called on baserels
or their children, so it's fine.
- query_planner() loops over baserels and their children, so it's fine.

Right.

Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be
used in some of these places, just for symmetry.

I was wondering about this as well. Although, I though it better not
to touch base relations in partition-wise join. But now, I have added
that macro and adjusted corresponding tests in the code. See 0001.

You may actually want to squash 0001 and 0002 into a single patch. But
for now, I have left those as separate.

The point is that
there are really three questions here: (1) is it some kind of baserel
(parent or child)? (2) is it some kind of joinrel (parent or child)?
and (3) is it some kind of child (baserel or join)? Right now, both
#2 and #3 are tested by just comparing against
RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we
add child joinrels. The goal of 0001, IMV, ought to be to try to
figure out which of #1, #2, and #3 is being checked in each case and
make that clear via use of an appropriate macro. (If is-other-baserel
is the real test, then fine, but I bet that's a rare case.)

Agreed. I have gone through all the cases, and fixed the necessary
ones as explained above and in the commit messages of 0001.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v13.zipapplication/zip; name=pg_dp_join_patches_v13.zip
#92Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#91)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi Ashutosh,

On 2017/03/23 21:48, Ashutosh Bapat wrote:

I have fixed all the issues reported till now.

In patch 0007, the following code in have_partkey_equi_join() looks
potentially unsafe:

/*
* If the clause refers to different partition keys from
* both relations, it can not be used for partition-wise join.
*/
if (ipk1 != ipk2)
continue;

/*
* The clause allows partition-wise join if only it uses the same
* operator family as that specified by the partition key.
*/
if (!list_member_oid(rinfo->mergeopfamilies,
part_scheme->partopfamily[ipk1]))
continue;

What if ipk1 and ipk2 both turn out to be -1? Accessing
part_schem->partopfamily[ipk1] would be incorrect, no?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#91)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi Ashutosh,

On 2017/03/23 21:48, Ashutosh Bapat wrote:

I have fixed all the issues reported till now.

I've tried to fix your 0012 patch (Multi-level partitioned table
expansion) considering your message earlier on this thread [1]/messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com.
Especially the fact that no AppendRelInfo and RelOptInfo are allocated for
partitioned child tables as of commit d3cc37f1d [2]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d. I've fixed
expand_inherited_rtentry() such that AppendRelInfo *is* allocated for a
partitioned child RTEs whose rte->inh is set to true. Such an RTE is
recursively expanded with that RTE the parent.

Also as I mentioned elsewhere [3]/messages/by-id/744d20fe-fc7b-f89e-8d06-6496ec537b86@lab.ntt.co.jp, the multi-level inheritance expansion
of partitioned table will break update/delete for partitioned table, which
is because inheritance_planner() is not ready to handle inheritance sets
structured that way. I tried to refactor inheritance_planner() such that
its core logic can be recursively invoked for partitioned child RTEs. The
resulting child paths and other auxiliary information related to planning
across the hierarchy are maintained in one place using a struct to hold
the same in a few flat lists. The refactoring didn't break any existing
tests and a couple of new tests are added to check that it indeed works
for multi-level partitioned tables expanded using new multi-level structure.

There is some test failure in 0014 (Multi-level partition-wise join
tests), probably because of the changes I made to 0012, which I didn't get
time to check why, although I've checked using an example that multi-level
join planning still works, so it's not completely broken either.

Thanks,
Amit

[1]: /messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com
/messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com

[2]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d

[3]: /messages/by-id/744d20fe-fc7b-f89e-8d06-6496ec537b86@lab.ntt.co.jp
/messages/by-id/744d20fe-fc7b-f89e-8d06-6496ec537b86@lab.ntt.co.jp

Attachments:

0012-Multi-level-partitioned-table-expansion.patchtext/x-diff; name=0012-Multi-level-partitioned-table-expansion.patch
#94Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#92)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Mar 24, 2017 at 1:57 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Ashutosh,

On 2017/03/23 21:48, Ashutosh Bapat wrote:

I have fixed all the issues reported till now.

In patch 0007, the following code in have_partkey_equi_join() looks
potentially unsafe:

/*
* If the clause refers to different partition keys from
* both relations, it can not be used for partition-wise join.
*/
if (ipk1 != ipk2)
continue;

/*
* The clause allows partition-wise join if only it uses the same
* operator family as that specified by the partition key.
*/
if (!list_member_oid(rinfo->mergeopfamilies,
part_scheme->partopfamily[ipk1]))
continue;

What if ipk1 and ipk2 both turn out to be -1? Accessing
part_schem->partopfamily[ipk1] would be incorrect, no?

Thanks for the report. Surprising this should have crashed sometime,
but didn't ever. Neither it showed wrong output for queries where
partition keys were not part of equi-joins. The reason being
partopfamily[-1] had 0 in it, which when tested again
list_member_oid(rinfo->mergeopfamilies, ..) returned false. Attached
patches fix this code.

Also, I have fixed few grammar mistakes, typos, renamed variables in
PartitionSchemeData to match those in PartitionKey. I have squashed
the patches introducing IS_JOIN_REL, IS_OTHER_REL, IS_SIMPLE_REL into
one.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v14.zipapplication/zip; name=pg_dp_join_patches_v14.zip
#95Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#93)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Mar 24, 2017 at 4:18 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Ashutosh,

On 2017/03/23 21:48, Ashutosh Bapat wrote:

I have fixed all the issues reported till now.

I've tried to fix your 0012 patch (Multi-level partitioned table
expansion) considering your message earlier on this thread [1].
Especially the fact that no AppendRelInfo and RelOptInfo are allocated for
partitioned child tables as of commit d3cc37f1d [2]. I've fixed
expand_inherited_rtentry() such that AppendRelInfo *is* allocated for a
partitioned child RTEs whose rte->inh is set to true. Such an RTE is
recursively expanded with that RTE the parent.

Also as I mentioned elsewhere [3], the multi-level inheritance expansion
of partitioned table will break update/delete for partitioned table, which
is because inheritance_planner() is not ready to handle inheritance sets
structured that way. I tried to refactor inheritance_planner() such that
its core logic can be recursively invoked for partitioned child RTEs. The
resulting child paths and other auxiliary information related to planning
across the hierarchy are maintained in one place using a struct to hold
the same in a few flat lists. The refactoring didn't break any existing
tests and a couple of new tests are added to check that it indeed works
for multi-level partitioned tables expanded using new multi-level structure.

There is some test failure in 0014 (Multi-level partition-wise join
tests), probably because of the changes I made to 0012, which I didn't get
time to check why, although I've checked using an example that multi-level
join planning still works, so it's not completely broken either.

I have gone through the patch, and it looks good to me. Here's the set
of patches with this patch included. Fixed the testcase failures.
Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v15.zipapplication/zip; name=pg_dp_join_patches_v15.zip
#96Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#95)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 27, 2017 at 8:36 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have gone through the patch, and it looks good to me. Here's the set
of patches with this patch included. Fixed the testcase failures.
Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9.

This version of 0001 looks much better to me, but I still have some concerns.

I think we should also introduce IS_UPPER_REL() at the same time, for
symmetry and because partitionwise aggregate will need it, and use it
in place of direct tests against RELOPT_UPPER_REL.

I think it would make sense to change the test in deparseFromExpr() to
check for IS_JOIN_REL() || IS_SIMPLE_REL(). There's no obvious reason
why that shouldn't be OK, and it would remove the last direct test
against RELOPT_JOINREL in the tree, and it will probably need to be
changed for partitionwise aggregate anyway.

Could set_append_rel_size Assert(IS_SIMPLE_REL(rel))? I notice that
you did this in some other places such as
generate_implied_equalities_for_column(), and I like that. If for
some reason that's not going to work, then it's doubtful whether
Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL) is going to
survive either.

Similarly, I think relation_excluded_by_constraints() would also
benefit from Assert(IS_SIMPLE_REL(rel)).

Why not set top_parent_relids earlier, when actually creating the
RelOptInfo? I think you could just change build_simple_rel() so that
instead of passing RelOptKind reloptkind, you instead pass RelOptInfo
*parent. I think postponing that work until set_append_rel_size()
just introduces possible bugs resulting from it not being set early
enough.

Apart from the above, I think 0001 is in good shape.

Regarding 0002, I think the parts that involve factoring out
find_param_path_info() are uncontroversial. Regarding the changes to
adjust_appendrel_attrs(), my main question is whether we wouldn't be
better off using an array representation rather than a List
representation. In other words, this function could take PlannerInfo
*root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing
callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
needing to allocate. To make this work, adjust_child_relids() and
find_appinfos_by_relids() would need to be adjusted to use a similar
argument-passing convention. I suspect this makes iterating over the
AppendRelInfos mildly faster, too, apart from the memory savings.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#96)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 28, 2017 at 12:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Regarding 0002, I think the parts that involve factoring out
find_param_path_info() are uncontroversial. Regarding the changes to
adjust_appendrel_attrs(), my main question is whether we wouldn't be
better off using an array representation rather than a List
representation. In other words, this function could take PlannerInfo
*root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing
callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
needing to allocate. To make this work, adjust_child_relids() and
find_appinfos_by_relids() would need to be adjusted to use a similar
argument-passing convention. I suspect this makes iterating over the
AppendRelInfos mildly faster, too, apart from the memory savings.

Still regarding 0002, looking at adjust_appendrel_attrs_multilevel,
could we have a common code path for the baserel and joinrel cases?
It seems like perhaps you could just loop over root->append_rel_list.
For each appinfo, if (bms_is_member(appinfo->child_relid,
child_rel->relids)) bms_add_member(parent_relids,
appinfo->parent_relid).

This implementation would have some loss of efficiency in the
single-rel case because we'd scan all of the AppendRelInfos in the
list even if there's only one relid. But you could fix that by
writing it like this:

foreach (lc, root->append_rel_list)
{
if (bms_is_member(appinfo->child_relid, child_rel->relids))
{
bms_add_member(parent_relids, appinfo->parent_relid);
if (child_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
break; /* only one relid to find, and we've found it */
}
}
Assert(bms_num_members(child_rel->relids) == bms_num_members(parent_relids));

That seems pretty slick. It is just as fast as the current
implementation for the single-rel case. It allocates no memory
(unlike what you've got now). And it handles the joinrel case using
essentially the same code as the simple rel case.

In 0003, it seems that it would be more consistent with what you did
elsewhere if the last argument to allow_star_schema_join were named
inner_paramrels rather than innerparams. Other than that, I don't see
anything to complain about.

In 0004:

+                                       Assert(!rel->part_rels[cnt_parts]);
+                                       rel->part_rels[cnt_parts] = childrel;

break here?

+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation
relation, bool inhparent)
+{
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !inhparent ||

I still think the inhparent check should be moved to the caller.

In 0005:

+ *             Returns a list of the RT indexes of the partitioned
child relations
+ *             with any of joining relations' rti as the root parent RT index.

I found this wording confusing. Maybe: Build and return a list
containing the RTI of every partitioned relation which is a child of
some rel included in the join.

+ * Note: Only call this function on joins between partitioned tables.

Or what, the boogeyman will come and get you?

(In other words, I don't think that's a very informative comment.)

I don't think 0011 is likely to be acceptable in current form. I
can't imagine that we just went to the trouble of getting rid of
AppendRelInfos for child partitioned rels only to turn around and put
them back again. If you just need the parent-child mappings, you can
get that from the PartitionedChildRelInfo list.

Unfortunately, I don't think we're likely to be able to get this whole
patch series into a committable form in the next few days, but I'd
like to keep reviewing it and working with you on it; there's always
next cycle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#96)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Mar 28, 2017 at 10:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 27, 2017 at 8:36 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have gone through the patch, and it looks good to me. Here's the set
of patches with this patch included. Fixed the testcase failures.
Rebased the patchset on de4da168d57de812bb30d359394b7913635d21a9.

This version of 0001 looks much better to me, but I still have some concerns.

I think we should also introduce IS_UPPER_REL() at the same time, for
symmetry and because partitionwise aggregate will need it, and use it
in place of direct tests against RELOPT_UPPER_REL.

Ok. Done. I introduced IS_JOIN_REL and IS_OTHER_REL only to simplify
the tests for child-joins. But now we have grown this patch with
IS_SIMPLE_REL() and IS_UPPER_REL(). That has introduced changes
unrelated to partition-wise join. But I am happy with the way the code
looks now with all IS_*_REL() macros. If we delay this commit, some
more usages of bare RELOPT_* would creep in the code. To avoid that,
we may want to commit these changes in v10.

I think it would make sense to change the test in deparseFromExpr() to
check for IS_JOIN_REL() || IS_SIMPLE_REL(). There's no obvious reason
why that shouldn't be OK, and it would remove the last direct test
against RELOPT_JOINREL in the tree, and it will probably need to be
changed for partitionwise aggregate anyway.

Done. However, we need another assertion to make sure than an "other"
upper rel has an "other" rel as scanrel. That can be added when
partition-wise aggregate, which would introduce "other" upper rels, is
implemented.

Could set_append_rel_size Assert(IS_SIMPLE_REL(rel))? I notice that
you did this in some other places such as
generate_implied_equalities_for_column(), and I like that. If for
some reason that's not going to work, then it's doubtful whether
Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL) is going to
survive either.

Done. Also modified prologue of that function to explicitly say simple
"append relation" since we can have join "append relations" and upper
"append relations" with partition-wise operations.

Similarly, I think relation_excluded_by_constraints() would also
benefit from Assert(IS_SIMPLE_REL(rel)).

Done.

Why not set top_parent_relids earlier, when actually creating the
RelOptInfo? I think you could just change build_simple_rel() so that
instead of passing RelOptKind reloptkind, you instead pass RelOptInfo
*parent. I think postponing that work until set_append_rel_size()
just introduces possible bugs resulting from it not being set early
enough.

Done.

Apart from the above, I think 0001 is in good shape.

Regarding 0002, I think the parts that involve factoring out
find_param_path_info() are uncontroversial. Regarding the changes to
adjust_appendrel_attrs(), my main question is whether we wouldn't be
better off using an array representation rather than a List
representation. In other words, this function could take PlannerInfo
*root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing
callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
needing to allocate. To make this work, adjust_child_relids() and
find_appinfos_by_relids() would need to be adjusted to use a similar
argument-passing convention. I suspect this makes iterating over the
AppendRelInfos mildly faster, too, apart from the memory savings.

Done.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#97)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 28, 2017 at 12:54 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Regarding 0002, I think the parts that involve factoring out
find_param_path_info() are uncontroversial. Regarding the changes to
adjust_appendrel_attrs(), my main question is whether we wouldn't be
better off using an array representation rather than a List
representation. In other words, this function could take PlannerInfo
*root, Node *node, int nappinfos, AppendRelInfo **appinfos. Existing
callers doing adjust_appendrel_attrs(root, whatever, appinfo) could
just do adjust_appendrel_attrs(root, whatever, 1, &appinfo), not
needing to allocate. To make this work, adjust_child_relids() and
find_appinfos_by_relids() would need to be adjusted to use a similar
argument-passing convention. I suspect this makes iterating over the
AppendRelInfos mildly faster, too, apart from the memory savings.

Still regarding 0002, looking at adjust_appendrel_attrs_multilevel,
could we have a common code path for the baserel and joinrel cases?
It seems like perhaps you could just loop over root->append_rel_list.
For each appinfo, if (bms_is_member(appinfo->child_relid,
child_rel->relids)) bms_add_member(parent_relids,
appinfo->parent_relid).

This implementation would have some loss of efficiency in the
single-rel case because we'd scan all of the AppendRelInfos in the
list even if there's only one relid. But you could fix that by
writing it like this:

foreach (lc, root->append_rel_list)
{
if (bms_is_member(appinfo->child_relid, child_rel->relids))
{
bms_add_member(parent_relids, appinfo->parent_relid);
if (child_rel->reloptkind == RELOPT_OTHER_MEMBER_REL)
break; /* only one relid to find, and we've found it */
}
}
Assert(bms_num_members(child_rel->relids) == bms_num_members(parent_relids));

That seems pretty slick. It is just as fast as the current
implementation for the single-rel case. It allocates no memory
(unlike what you've got now). And it handles the joinrel case using
essentially the same code as the simple rel case.

I got rid of those differences completely by using trick similar to
adjust_child_relids_multilevel(), which uses top_parent_relids instead
of rel->reloptkind to decide whether we have reached the top parent or
not. Those can trickle down from the topmost caller to any depth in
recursion. This also avoids any call to find_*_rel(), which was the
main reason why we had different code paths for base and join
relation.

In 0003, it seems that it would be more consistent with what you did
elsewhere if the last argument to allow_star_schema_join were named
inner_paramrels rather than innerparams. Other than that, I don't see
anything to complain about.

I had used the same name as the local variable declared with the same
purpose. But this change looks good. Done.

In 0004:

+                                       Assert(!rel->part_rels[cnt_parts]);
+                                       rel->part_rels[cnt_parts] = childrel;

break here?

Right, done.

+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation
relation, bool inhparent)
+{
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !inhparent ||

I still think the inhparent check should be moved to the caller.

Done.

In 0005:

+ *             Returns a list of the RT indexes of the partitioned
child relations
+ *             with any of joining relations' rti as the root parent RT index.

I found this wording confusing. Maybe: Build and return a list
containing the RTI of every partitioned relation which is a child of
some rel included in the join.

This is better. Thanks. Done.

+ * Note: Only call this function on joins between partitioned tables.

Or what, the boogeyman will come and get you?

(In other words, I don't think that's a very informative comment.)

I mimicked the prologue of earlier function. I guess, similar comment
in the prologue of earlier function is written because if we use
something other than a partitioned table there, the assertion at the
end of that function would trip. Similarly, for this function, the
assertion at the end of the function will trip, if we use it for
something other than a join relation.

PFA patches rebased on f90d23d0c51895e0d7db7910538e85d3d38691f0.

I don't think 0011 is likely to be acceptable in current form. I
can't imagine that we just went to the trouble of getting rid of
AppendRelInfos for child partitioned rels only to turn around and put
them back again. If you just need the parent-child mappings, you can
get that from the PartitionedChildRelInfo list.

I will reply to this separately.

Unfortunately, I don't think we're likely to be able to get this whole
patch series into a committable form in the next few days, but I'd
like to keep reviewing it and working with you on it; there's always
next cycle.

Thanks for all your efforts in reviewing the patches and for your
excellent suggestions to improve the patches.

As I have stated earlier, it will help if we can get 0001 committed,
may be 0002. 0004 introduces the concept of partitioning scheme which
seems to be vital for partititon-wise aggregation, partition pruning
and may be sorting optimization discussed in [1]. If we are able to
commit it in this commitfest, the patches for those optimizations can
take advantage of partitioning scheme. I understand it's really close
to the end of this commitfest and we may not be able to commit even
those patches.

https://www.mail-archive.com/pgsql-hackers@postgresql.org/msg308742.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v16.zipapplication/zip; name=pg_dp_join_patches_v16.zip
#100Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#97)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't think 0011 is likely to be acceptable in current form. I
can't imagine that we just went to the trouble of getting rid of
AppendRelInfos for child partitioned rels only to turn around and put
them back again. If you just need the parent-child mappings, you can
get that from the PartitionedChildRelInfo list.

Please refer to my earlier mails on this subject [1]/messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com, [2]. For
multi-level partition-wise join, we need RelOptInfo of a partitioned
table to contain RelOptInfo of its immediate partitions. I have not
seen any counter arguments not to create RelOptInfos for intermediate
partitioned tables. We create child RelOptInfos only for entries in
root->append_rel_list i.e. only for those relations which have an
AppendRelInfo. Since we are not creating AppendRelInfos for
partitioned partitions, we do not create RelOptInfos for those. So, to
me it looks like we have to either have AppendRelInfos for partitioned
partitions or create RelOptInfos by traversing some other list like
PartitionedChildRelInfo list. It looks odd to walk
root->append_rel_list as well as this new list for creating
RelOptInfos. But for a moment, we assume that we have to walk this
other list. But then that other list is also lossy. It stores only the
topmost parent of any of the partitioned partitions and not the
immediate parent as required to add RelOptInfos of immediate children
to the RelOptInfo of a parent.

Coming back to the point of PartitionedChildRelInfo list as a way to
maintain parent - child relationship. All the code assumes that the
parent-child relationship is stored in AppendRelInfo linked as
root->append_rel_list and walks that list to find children of a given
parent of parent/s of a given child. We will have to modify all those
places to traverse two lists instead of one. Some of those even return
AppendRelInfo structure, and now they some times return an
AppendRelInfo and sometimes PartitionedChildRelInfo. That looks ugly.

Consider a case where P has partitions p1 and p2, which in turn have
partitions p11, p12 and p21, p22 resp. Another partitioned table Q has
partitions q1, q2. q1 is further partitioned into q11, q12 but q2 is
not partitioned. The partitioning scheme of P and Q matches. Also,
partitioning scheme of p1 and q1 matches. So, a partition-wise join
between P and Q would look like P J Q = append (p11 J q11, p12 J q12,
p2 J q2), p2 J q2 being append(p21, p22) J q2. When constructing the
restrictlist (and other clauses) for p2 J q2 we need to translate the
restrictlist applicable for P J Q. This translation requires
AppendRelInfo of p2 which does not exist today. We can not use
PartitionedChildRelInfo because it doesn't have information about how
to translate Vars of P to those of p2.

I don't see a way to avoid creating AppendRelInfos for partitioned partitions.

[1]: /messages/by-id/CAFjFpRefs5ZMnxQ2vP9v5zOtWtNPuiMYc01sb1SWjCOB1CT=uQ@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#100)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/03/30 18:35, Ashutosh Bapat wrote:

On Wed, Mar 29, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't think 0011 is likely to be acceptable in current form. I
can't imagine that we just went to the trouble of getting rid of
AppendRelInfos for child partitioned rels only to turn around and put
them back again. If you just need the parent-child mappings, you can
get that from the PartitionedChildRelInfo list.

Please refer to my earlier mails on this subject [1], [2]. For
multi-level partition-wise join, we need RelOptInfo of a partitioned
table to contain RelOptInfo of its immediate partitions. I have not
seen any counter arguments not to create RelOptInfos for intermediate
partitioned tables. We create child RelOptInfos only for entries in
root->append_rel_list i.e. only for those relations which have an
AppendRelInfo. Since we are not creating AppendRelInfos for
partitioned partitions, we do not create RelOptInfos for those. So, to
me it looks like we have to either have AppendRelInfos for partitioned
partitions or create RelOptInfos by traversing some other list like
PartitionedChildRelInfo list. It looks odd to walk
root->append_rel_list as well as this new list for creating
RelOptInfos. But for a moment, we assume that we have to walk this
other list. But then that other list is also lossy. It stores only the
topmost parent of any of the partitioned partitions and not the
immediate parent as required to add RelOptInfos of immediate children
to the RelOptInfo of a parent.

So, because we want to create an Append path for each partitioned table in
a tree separately, we'll need RelOptInfo for each one, which in turn
requires an AppendRelInfo. Note that we do that only for those
partitioned child RTEs that have inh set to true, so that all the later
stages will treat it as the parent rel to create an Append path for.
There would still be partitioned child RTEs with inh set to false for
which, just like before, no AppendRelInfos and RelOptInfos are created;
they get added as the only member of partitioned_rels in the
PartitionedChildRelInfo of each partitioned table. Finally, when the
Append path for the root parent is created, its subpaths list will contain
paths of leaf partitions of all levels and its partitioned_rels list
should contain the RT indexes of partitioned tables of all levels.

If we have the following partition tree:

A
/ | \
B C D
/ \
E F

The following RTEs will be created, in that order. RTEs with inh=true are
shown with suffix _i. RTEs that get an AppendRelInfo (& a RelOptInfo) are
shown with suffix _a.

A_i_a
A
B_a
C_i_a
C
E_a
F_a
D_a

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#101)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 30, 2017 at 6:32 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

So, because we want to create an Append path for each partitioned table in
a tree separately, we'll need RelOptInfo for each one, which in turn
requires an AppendRelInfo.

Hmm. It's no more desirable to have an Append inside of another
Append with partitionwise join than it is in general. If we've got A
partitioned into A1, A2, A3 and similarly B partitioned into B1, B2,
and B3, and then A1 and B1 are further partitioned into A1a, A1b, B1a,
B1b, then a partitionwise join between the tables ought to end up
looking like this:

Append
-> Join (A1a, B1a)
-> Join (A1b, B1b)
-> Join (A2, B2)
-> Join (A3, B3)

So here we don't actually end up with an append-path for A1-B1 here
anywhere. But you might need that in more complex cases, I guess,
because suppose you now join this to C with partitions C1, C2, C3; but
C1 is not sub-partitioned. Then you might end up with a plan like:

Append
-> Join
-> Append
-> Join (A1a, B1a)
-> Join (A1b, B1b)
-> Scan C1
-> Join ((A2, B2), C2)
-> Join ((A3, B3), C3)

So maybe you're right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#98)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Done.

Ashutosh and I spent several hours discussing this patch set today.
I'm starting to become concerned about the fact that 0004 makes the
partition bounds part of the PartitionScheme, because that means you
can't do a partition-wise join between two tables that have any
difference at all in the partition bounds. It might be possible in
the future to introduce a notion of a compatible partition scheme, so
that you could say, OK, well, these two partition schemes are not
quite the same, but they are compatible, and we'll make a new
partition scheme for whatever results from reconciling them.

What I think *may* be better is to consider the partition bound
information as a property of the RelOptInfo rather than the
PartitionScheme. For example, suppose we're joining A with partitions
A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
partitions C1, C2, and C5. With the current approach, we end up with
a PartitionScheme for each baserel and, not in this patch set but
maybe eventually, a separate PartitionScheme for each of (A B), (A C),
(B C), and (A B C). That seems pretty unsatisfying. If we consider
the PartitionScheme to only include the question of whether we're
doing a join on the partition keys, then if the join includes WHERE
a.key = b.key AND b.key = c.key, we can say that they all have the
same PartitionScheme up front. Then, each RelOptInfo can have a
separate list of bounds, like this:

A: 1, 2, 4
B: 1, 2, 3
C: 1, 2, 5
A B: 1, 2, 3, 4
A C: 1, 2, 4, 5
B C: 1, 2, 3, 5
A B C: 1, 2, 3, 4, 5

Or if it's an inner join, then instead of taking the union at each
level, we can take the intersection, because any partition without a
match on the other side of the join, then that partition can't produce
any rows and doesn't need to be scanned. In that case, the
RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
with a bound list of 1, 2.

A related question (that I did not discuss with Ashutosh, but occurred
to me later) is whether the PartitionScheme ought to worry about
cross-type joins. For instance, if A is partitioned on an int4 column
and B is partitioned on an int8 column, and they are joined on their
respective partitioning columns, can't we still do a partition-wise
join? We do need to require that the operator family of the operator
actually used in the query, the operator family used to partition the
inner table, and the operator family used to partition the other table
all match; and the collation used for the comparison in the query, the
collation used to partition the outer table, and the collation used to
partition the inner table must all match. But it doesn't seem
necessary to require an exact type or typmod match. In many ways this
seems a whole lot like the what we test when building equivalence
classes (cf. process_equivalence) although I'm not sure that we can
leverage that in any useful way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#103)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Apr 4, 2017 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Done.

Ashutosh and I spent several hours discussing this patch set today.
I'm starting to become concerned about the fact that 0004 makes the
partition bounds part of the PartitionScheme, because that means you
can't do a partition-wise join between two tables that have any
difference at all in the partition bounds. It might be possible in
the future to introduce a notion of a compatible partition scheme, so
that you could say, OK, well, these two partition schemes are not
quite the same, but they are compatible, and we'll make a new
partition scheme for whatever results from reconciling them.

What I think *may* be better is to consider the partition bound
information as a property of the RelOptInfo rather than the
PartitionScheme. For example, suppose we're joining A with partitions
A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
partitions C1, C2, and C5. With the current approach, we end up with
a PartitionScheme for each baserel and, not in this patch set but
maybe eventually, a separate PartitionScheme for each of (A B), (A C),
(B C), and (A B C). That seems pretty unsatisfying. If we consider
the PartitionScheme to only include the question of whether we're
doing a join on the partition keys, then if the join includes WHERE
a.key = b.key AND b.key = c.key, we can say that they all have the
same PartitionScheme up front. Then, each RelOptInfo can have a
separate list of bounds, like this:

A: 1, 2, 4
B: 1, 2, 3
C: 1, 2, 5
A B: 1, 2, 3, 4
A C: 1, 2, 4, 5
B C: 1, 2, 3, 5
A B C: 1, 2, 3, 4, 5

Or if it's an inner join, then instead of taking the union at each
level, we can take the intersection, because any partition without a
match on the other side of the join, then that partition can't produce
any rows and doesn't need to be scanned. In that case, the
RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
with a bound list of 1, 2.

I have separated partition bounds from partition scheme. The patch adds
build_joinrel_partition_bounds() to calculate the bounds of the join
relation and the pairs of matching partitions from the joining relation.
For now the function just check whether both the relations have same bounds
and returns the bounds of the first one. But in future, we will expand this
function to merge partition bounds from the joining relation and return
pairs of matching partitions which when joined form the partitions of the
join according to the merged partition bounds.

Also, moved the code to collect partition RelOptInfos from
set_append_rel_size() to build_simple_rel(), so everything related to
partitioning gets set in that function for a base relation.

I think, we should rename partition scheme as PartitionKeyOptInfo and club
partition bounds, nparts and part_rels as PartitionDescOptInfo. But I
haven't done that in this patch yet.

A related question (that I did not discuss with Ashutosh, but occurred
to me later) is whether the PartitionScheme ought to worry about
cross-type joins. For instance, if A is partitioned on an int4 column
and B is partitioned on an int8 column, and they are joined on their
respective partitioning columns, can't we still do a partition-wise
join? We do need to require that the operator family of the operator
actually used in the query, the operator family used to partition the
inner table, and the operator family used to partition the other table
all match; and the collation used for the comparison in the query, the
collation used to partition the outer table, and the collation used to
partition the inner table must all match. But it doesn't seem
necessary to require an exact type or typmod match. In many ways this
seems a whole lot like the what we test when building equivalence
classes (cf. process_equivalence) although I'm not sure that we can
leverage that in any useful way.

Yes, I agree. For an inner join, the partition key types need to "shrink"
and for outer join they need to be "widened". I don't know if there is a
way to know "wider" or "shorter" of two given types. We might have to
implement a method to merge partition keys to produce partition key of the
join, which may be different from either of the partition keys. So,
after-all we may have to abandon the idea of canonical partition scheme. I
haven't included this change in the attached set of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v17.zipapplication/zip; name=pg_dp_join_patches_v17.zip
#105Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#104)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Somehow I sent the old patch set again. Here's the real v17.

On Tue, Apr 4, 2017 at 7:52 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Apr 4, 2017 at 2:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 30, 2017 at 1:14 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Done.

Ashutosh and I spent several hours discussing this patch set today.
I'm starting to become concerned about the fact that 0004 makes the
partition bounds part of the PartitionScheme, because that means you
can't do a partition-wise join between two tables that have any
difference at all in the partition bounds. It might be possible in
the future to introduce a notion of a compatible partition scheme, so
that you could say, OK, well, these two partition schemes are not
quite the same, but they are compatible, and we'll make a new
partition scheme for whatever results from reconciling them.

What I think *may* be better is to consider the partition bound
information as a property of the RelOptInfo rather than the
PartitionScheme. For example, suppose we're joining A with partitions
A1, A2, and A4 against B with partitions B1, B2, and B3 and C with
partitions C1, C2, and C5. With the current approach, we end up with
a PartitionScheme for each baserel and, not in this patch set but
maybe eventually, a separate PartitionScheme for each of (A B), (A C),
(B C), and (A B C). That seems pretty unsatisfying. If we consider
the PartitionScheme to only include the question of whether we're
doing a join on the partition keys, then if the join includes WHERE
a.key = b.key AND b.key = c.key, we can say that they all have the
same PartitionScheme up front. Then, each RelOptInfo can have a
separate list of bounds, like this:

A: 1, 2, 4
B: 1, 2, 3
C: 1, 2, 5
A B: 1, 2, 3, 4
A C: 1, 2, 4, 5
B C: 1, 2, 3, 5
A B C: 1, 2, 3, 4, 5

Or if it's an inner join, then instead of taking the union at each
level, we can take the intersection, because any partition without a
match on the other side of the join, then that partition can't produce
any rows and doesn't need to be scanned. In that case, the
RelOptInfos for (A B), (A C), (B, C), and (A, B, C) will all end up
with a bound list of 1, 2.

I have separated partition bounds from partition scheme. The patch adds
build_joinrel_partition_bounds() to calculate the bounds of the join
relation and the pairs of matching partitions from the joining relation. For
now the function just check whether both the relations have same bounds and
returns the bounds of the first one. But in future, we will expand this
function to merge partition bounds from the joining relation and return
pairs of matching partitions which when joined form the partitions of the
join according to the merged partition bounds.

Also, moved the code to collect partition RelOptInfos from
set_append_rel_size() to build_simple_rel(), so everything related to
partitioning gets set in that function for a base relation.

I think, we should rename partition scheme as PartitionKeyOptInfo and club
partition bounds, nparts and part_rels as PartitionDescOptInfo. But I
haven't done that in this patch yet.

A related question (that I did not discuss with Ashutosh, but occurred
to me later) is whether the PartitionScheme ought to worry about
cross-type joins. For instance, if A is partitioned on an int4 column
and B is partitioned on an int8 column, and they are joined on their
respective partitioning columns, can't we still do a partition-wise
join? We do need to require that the operator family of the operator
actually used in the query, the operator family used to partition the
inner table, and the operator family used to partition the other table
all match; and the collation used for the comparison in the query, the
collation used to partition the outer table, and the collation used to
partition the inner table must all match. But it doesn't seem
necessary to require an exact type or typmod match. In many ways this
seems a whole lot like the what we test when building equivalence
classes (cf. process_equivalence) although I'm not sure that we can
leverage that in any useful way.

Yes, I agree. For an inner join, the partition key types need to "shrink"
and for outer join they need to be "widened". I don't know if there is a way
to know "wider" or "shorter" of two given types. We might have to implement
a method to merge partition keys to produce partition key of the join, which
may be different from either of the partition keys. So, after-all we may
have to abandon the idea of canonical partition scheme. I haven't included
this change in the attached set of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v17.zipapplication/zip; name=pg_dp_join_patches_v17.zip
#106Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#104)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Yes, I agree. For an inner join, the partition key types need to "shrink"
and for outer join they need to be "widened". I don't know if there is a way
to know "wider" or "shorter" of two given types. We might have to implement
a method to merge partition keys to produce partition key of the join, which
may be different from either of the partition keys. So, after-all we may
have to abandon the idea of canonical partition scheme. I haven't included
this change in the attached set of patches.

I think this is why you need to regard the partitioning scheme as
something more like an equivalence class - possibly the partitioning
scheme should actually contain (or be?) an equivalence class. Suppose
this is the query:

SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;

...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
partitioning key. It's meaningless to ask whether the result of the
join is partitioned by int4 or int8. It's partitioned by the
equivalence class that contains both i4.x and i8.x. If the result of
this join where joined to another table on either of those two
columns, a second partition-wise join would be theoretically possible.
If you insist on knowing the type of the partitioning scheme, rather
than just the opfamily, you've boxed yourself into a corner from which
there's no good escape.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#106)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Yes, I agree. For an inner join, the partition key types need to "shrink"
and for outer join they need to be "widened". I don't know if there is a way
to know "wider" or "shorter" of two given types. We might have to implement
a method to merge partition keys to produce partition key of the join, which
may be different from either of the partition keys. So, after-all we may
have to abandon the idea of canonical partition scheme. I haven't included
this change in the attached set of patches.

I think this is why you need to regard the partitioning scheme as
something more like an equivalence class - possibly the partitioning
scheme should actually contain (or be?) an equivalence class. Suppose
this is the query:

SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;

...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
partitioning key. It's meaningless to ask whether the result of the
join is partitioned by int4 or int8. It's partitioned by the
equivalence class that contains both i4.x and i8.x. If the result of
this join where joined to another table on either of those two
columns, a second partition-wise join would be theoretically possible.
If you insist on knowing the type of the partitioning scheme, rather
than just the opfamily, you've boxed yourself into a corner from which
there's no good escape.

Only inner join conditions have equivalence classes associated with
those. Outer join conditions create single element equivalence
classes. So, we can not associate equivalence classes as they are with
partition scheme. If we could do that, it makes life much easier since
checking whether equi-join between all partition keys exist, is simply
looking up equivalence classes that cover joining relations and find
em_member corresponding to partition keys.

It looks like we should only keep strategy, partnatts, partopfamily
and parttypcoll in PartitionScheme. A partition-wise join between two
relations would be possible if all those match. When matching
partition bounds of joining relations, we should rely on partopfamily
to give us comparison function based on the types of partition keys
being joined. In that context it looks like all the partition bound
comparision functions which accept partition key were not written
keeping this use case in mind. They will need to be rewritten to
accept strategy, partnatts, partopfamily and parttypcoll.

There's a relevant comment in 0006, build_joinrel_partition_info()
(probably that name needs to change, but I will do that once we have
settled on design)
+   /*
+    * Construct partition keys for the join.
+    *
+    * An INNER join between two partitioned relations is partition by key
+    * expressions from both the relations. For tables A and B
partitioned by a and b
+    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
+    * and B.b.
+    *
+    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+    * B.b NULL. These rows may not fit the partitioning conditions imposed on
+    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
+    * Strictly speaking, partition keys of an OUTER join should include
+    * partition key expressions from the OUTER side only. Consider a join like
+    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
+    * include B.b as partition key expression for (AB), it prohibits us from
+    * using partition-wise join when joining (AB) with C as there is no
+    * equi-join between partition keys of joining relations. But two NULL
+    * values are never equal and no two rows from mis-matching partitions can
+    * join. Hence it's safe to include B.b as partition key expression for
+    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
+    */

I think that also needs to be reviewed carefully. Partition-wise joins
may be happy including partition keys from all sides, but
partition-wise aggregates may not be, esp. when pushing complete
aggregation down to partitions. In that case, rows with NULL partition
key, which falls on nullable side of join, will be spread across
multiple partitions. Proabably, we should separate nullable and
non-nullable partition key expressions.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#107)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Only inner join conditions have equivalence classes associated with
those. Outer join conditions create single element equivalence
classes. So, we can not associate equivalence classes as they are with
partition scheme. If we could do that, it makes life much easier since
checking whether equi-join between all partition keys exist, is simply
looking up equivalence classes that cover joining relations and find
em_member corresponding to partition keys.

OK.

It looks like we should only keep strategy, partnatts, partopfamily
and parttypcoll in PartitionScheme. A partition-wise join between two
relations would be possible if all those match.

Yes, I think so. Conceivably you could even exclude partnatts and
strategy, since there's nothing preventing a partitionwise join
between a list-partitioned table and a range-partitioned table, or
between a table range-partitioned on (a) and another range-partitioned
on (a, b), but there is probably not much benefit in trying to cover
such cases. I think it's reasonable to tell users that this is only
going to work when the partitioning strategy is the same and the join
conditions include all of the partitioning columns on both sides.

There's a relevant comment in 0006, build_joinrel_partition_info()
(probably that name needs to change, but I will do that once we have
settled on design)
+   /*
+    * Construct partition keys for the join.
+    *
+    * An INNER join between two partitioned relations is partition by key
+    * expressions from both the relations. For tables A and B
partitioned by a and b
+    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
+    * and B.b.
+    *
+    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+    * B.b NULL. These rows may not fit the partitioning conditions imposed on
+    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
+    * Strictly speaking, partition keys of an OUTER join should include
+    * partition key expressions from the OUTER side only. Consider a join like
+    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
+    * include B.b as partition key expression for (AB), it prohibits us from
+    * using partition-wise join when joining (AB) with C as there is no
+    * equi-join between partition keys of joining relations. But two NULL
+    * values are never equal and no two rows from mis-matching partitions can
+    * join. Hence it's safe to include B.b as partition key expression for
+    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
+    */

I think that also needs to be reviewed carefully.

The following passage from src/backend/optimizer/README seems highly relevant:

===
The planner's treatment of outer join reordering is based on the following
identities:

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B). If Pbc is not
strict, the first form might produce some rows with nonnull C columns
where the second form would make those entries null.
===

In other words, I think your statement that null is never equal to
null is a bit imprecise. Somebody could certainly create an operator
that is named "=" which returns true in that case, and then they could
say, hey, two nulls are equal (when you use that operator). The
argument needs to be made in terms of the formal properties of the
operator. The relevant logic is in have_partkey_equi_join:

+               /* Skip clauses which are not equality conditions. */
+               if (rinfo->hashjoinoperator == InvalidOid &&
!rinfo->mergeopfamilies)
+                       continue;

Actually, I think the hashjoinoperator test is formally and
practically unnecessary here; lower down there is a test to see
whether the partitioning scheme's operator family is a member of
rinfo->mergeopfamilies, which will certainly fail if we got through
this test with rinfo->mergeopfamilies == NIL just on the strength of
rinfo->hashjoinoperator != InvalidOid. So you can just bail out if
rinfo->mergeopfamilies == NIL. But the underlying point here is that
the only thing you really know about the function is that it's got to
be a strategy-3 operator in some btree opclass; if that guarantees
strictness, then so be it -- but I wasn't able to find anything in the
code or documentation off-hand that supports that contention, so we
might need to think a bit more about why (or if) this is guaranteed to
be true.

Partition-wise joins
may be happy including partition keys from all sides, but
partition-wise aggregates may not be, esp. when pushing complete
aggregation down to partitions. In that case, rows with NULL partition
key, which falls on nullable side of join, will be spread across
multiple partitions. Proabably, we should separate nullable and
non-nullable partition key expressions.

I don't think I understand quite what you're getting at here. Can you
spell this out in more detail? To push an aggregate down to
partitions, you need the grouping key to match the applicable
partition key, and the partition key shouldn't allow nulls in more
than one place. Now I think your point may be that outer join
semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
good test case for partitionwise aggregate. I'd be inclined to think
that we should just give up on partitionwise aggregate in such cases;
it's not worth trying to optimize such a weird query, at least IMHO.
(Does this sort of case ever happen with joins? I think not, as long
as the join operator is strict.)

I spent some time thinking about this patch set today and I don't see
that there's much point in committing any more of this to v10. I
think that 0001 and 0002 are probably committable or very close at
this point. However, 0001 is adding more complexity than I think is
warranted until we're actually ready to commit the feature that uses
it, and 0002 is so small that committing isn't really going to smooth
future development much. 0003-0009 are essentially all one big patch
that will have to be committed together.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#108)
Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas <robertmhaas@gmail.com> writes:

... But the underlying point here is that
the only thing you really know about the function is that it's got to
be a strategy-3 operator in some btree opclass; if that guarantees
strictness, then so be it -- but I wasn't able to find anything in the
code or documentation off-hand that supports that contention, so we
might need to think a bit more about why (or if) this is guaranteed to
be true.

FWIW, I do not think that follows. If you want to check that the
function is strict, check that explicitly.

It's very likely that in practice, all such functions are indeed strict,
but we don't have an assumption about that wired into the planner.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#108)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Only inner join conditions have equivalence classes associated with
those. Outer join conditions create single element equivalence
classes. So, we can not associate equivalence classes as they are with
partition scheme. If we could do that, it makes life much easier since
checking whether equi-join between all partition keys exist, is simply
looking up equivalence classes that cover joining relations and find
em_member corresponding to partition keys.

OK.

It looks like we should only keep strategy, partnatts, partopfamily
and parttypcoll in PartitionScheme. A partition-wise join between two
relations would be possible if all those match.

Yes, I think so. Conceivably you could even exclude partnatts and
strategy, since there's nothing preventing a partitionwise join
between a list-partitioned table and a range-partitioned table, or
between a table range-partitioned on (a) and another range-partitioned
on (a, b), but there is probably not much benefit in trying to cover
such cases. I think it's reasonable to tell users that this is only
going to work when the partitioning strategy is the same and the join
conditions include all of the partitioning columns on both sides.

There's a relevant comment in 0006, build_joinrel_partition_info()
(probably that name needs to change, but I will do that once we have
settled on design)
+   /*
+    * Construct partition keys for the join.
+    *
+    * An INNER join between two partitioned relations is partition by key
+    * expressions from both the relations. For tables A and B
partitioned by a and b
+    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
+    * and B.b.
+    *
+    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+    * B.b NULL. These rows may not fit the partitioning conditions imposed on
+    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
+    * Strictly speaking, partition keys of an OUTER join should include
+    * partition key expressions from the OUTER side only. Consider a join like
+    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
+    * include B.b as partition key expression for (AB), it prohibits us from
+    * using partition-wise join when joining (AB) with C as there is no
+    * equi-join between partition keys of joining relations. But two NULL
+    * values are never equal and no two rows from mis-matching partitions can
+    * join. Hence it's safe to include B.b as partition key expression for
+    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
+    */

I think that also needs to be reviewed carefully.

The following passage from src/backend/optimizer/README seems highly relevant:

===
The planner's treatment of outer join reordering is based on the following
identities:

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B). If Pbc is not
strict, the first form might produce some rows with nonnull C columns
where the second form would make those entries null.
===

In other words, I think your statement that null is never equal to
null is a bit imprecise. Somebody could certainly create an operator
that is named "=" which returns true in that case, and then they could
say, hey, two nulls are equal (when you use that operator). The
argument needs to be made in terms of the formal properties of the
operator. The relevant logic is in have_partkey_equi_join:

+               /* Skip clauses which are not equality conditions. */
+               if (rinfo->hashjoinoperator == InvalidOid &&
!rinfo->mergeopfamilies)
+                       continue;

Actually, I think the hashjoinoperator test is formally and
practically unnecessary here; lower down there is a test to see
whether the partitioning scheme's operator family is a member of
rinfo->mergeopfamilies, which will certainly fail if we got through
this test with rinfo->mergeopfamilies == NIL just on the strength of
rinfo->hashjoinoperator != InvalidOid. So you can just bail out if
rinfo->mergeopfamilies == NIL. But the underlying point here is that
the only thing you really know about the function is that it's got to
be a strategy-3 operator in some btree opclass; if that guarantees
strictness, then so be it -- but I wasn't able to find anything in the
code or documentation off-hand that supports that contention, so we
might need to think a bit more about why (or if) this is guaranteed to
be true.

I need more time to think about this. Will get back to this soon.

Partition-wise joins
may be happy including partition keys from all sides, but
partition-wise aggregates may not be, esp. when pushing complete
aggregation down to partitions. In that case, rows with NULL partition
key, which falls on nullable side of join, will be spread across
multiple partitions. Proabably, we should separate nullable and
non-nullable partition key expressions.

I don't think I understand quite what you're getting at here. Can you
spell this out in more detail? To push an aggregate down to
partitions, you need the grouping key to match the applicable
partition key, and the partition key shouldn't allow nulls in more
than one place. Now I think your point may be that outer join
semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
good test case for partitionwise aggregate. I'd be inclined to think
that we should just give up on partitionwise aggregate in such cases;
it's not worth trying to optimize such a weird query, at least IMHO.
(Does this sort of case ever happen with joins? I think not, as long
as the join operator is strict.)

Yes, this is the case, I am thinking about. No, it doesn't happen with join.

I spent some time thinking about this patch set today and I don't see
that there's much point in committing any more of this to v10. I
think that 0001 and 0002 are probably committable or very close at
this point. However, 0001 is adding more complexity than I think is
warranted until we're actually ready to commit the feature that uses
it, and 0002 is so small that committing isn't really going to smooth
future development much. 0003-0009 are essentially all one big patch
that will have to be committed together.

Ok. Thanks.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#106)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Yes, I agree. For an inner join, the partition key types need to "shrink"
and for outer join they need to be "widened". I don't know if there is a way
to know "wider" or "shorter" of two given types. We might have to implement
a method to merge partition keys to produce partition key of the join, which
may be different from either of the partition keys. So, after-all we may
have to abandon the idea of canonical partition scheme. I haven't included
this change in the attached set of patches.

I think this is why you need to regard the partitioning scheme as
something more like an equivalence class - possibly the partitioning
scheme should actually contain (or be?) an equivalence class. Suppose
this is the query:

SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;

...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
partitioning key. It's meaningless to ask whether the result of the
join is partitioned by int4 or int8. It's partitioned by the
equivalence class that contains both i4.x and i8.x. If the result of
this join where joined to another table on either of those two
columns, a second partition-wise join would be theoretically possible.
If you insist on knowing the type of the partitioning scheme, rather
than just the opfamily, you've boxed yourself into a corner from which
there's no good escape.

When we merge partition bounds from two relations with different
partition key types, the merged partition bounds need to have some
information abound the way those constants look like e.g. their
length, structure etc. That's the reason we need to store partition
key types of merged partitioning scheme. Consider a three way join (i4
JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare
partition bounds of i4 and i8, we use operators for int4 and int8. The
join i4 JOIN i8 will get partition bounds by merging those of i4 and
i8. When we come to join with i2, we need to know which operators to
use for comparing the partition bounds of the join with those of i2.

So, if the partition key types of the joining relations differ (but
they have matching partitioning schemes per strategy, natts and
operator family) the partition bounds of the join are converted to the
wider type among the partition key types of the joining tree.
Actually, as I am explained earlier we could choose a wider outer type
for an OUTER join and shorter type for inner join. This type is used
as partition key type of the join. In the above case join between i4
and i8 have its partition bounds converted to i8 (or i4) and then when
it is joined with i2 the partition bounds of the join are converted to
i8 (or i2).

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#111)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Apr 18, 2017 at 6:55 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

When we merge partition bounds from two relations with different
partition key types, the merged partition bounds need to have some
information abound the way those constants look like e.g. their
length, structure etc. That's the reason we need to store partition
key types of merged partitioning scheme. Consider a three way join (i4
JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare
partition bounds of i4 and i8, we use operators for int4 and int8. The
join i4 JOIN i8 will get partition bounds by merging those of i4 and
i8. When we come to join with i2, we need to know which operators to
use for comparing the partition bounds of the join with those of i2.

So, if the partition key types of the joining relations differ (but
they have matching partitioning schemes per strategy, natts and
operator family) the partition bounds of the join are converted to the
wider type among the partition key types of the joining tree.
Actually, as I am explained earlier we could choose a wider outer type
for an OUTER join and shorter type for inner join. This type is used
as partition key type of the join. In the above case join between i4
and i8 have its partition bounds converted to i8 (or i4) and then when
it is joined with i2 the partition bounds of the join are converted to
i8 (or i2).

I don't understand why you think that partition-wise join needs any
new logic here; if this were a non-partitionwise join, we'd similarly
need to use the correct operator, but the existing code handles that
just fine. If the join is performed partition-wise, it should use the
same operators that would have been used by a non-partitionwise join
between the same tables.

I think the choice of operator depends only on the column types, and
that the "width" of those types has nothing to do with it. For
example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
operator for an A/B join or a B/C join will be the one that appears in
the query; parse analysis will have identified which specific operator
is meant based on the types of the columns. If the optimizer
subsequently decides to reorder the joins and perform the A/C join
first, it will go hunt down the operator with the same strategy number
in the same operator family that takes the type of A.x on one side and
the type of C.x on the other side. No problem. A partition-wise join
between A and C will use that same operator; again, no problem.

Your example involves joining the output of a join between i4 and i8
against i2, so it seems there is some ambiguity about what the input
type should be. But, again, the planner already copes with this
problem. In fact, the join is performed either using i4.x or i8.x --
I don't know what happens, or whether it depends on other details of
the query or the plan -- and the operator which can accept that value
on one side and i2.x on the other side is the one that gets used.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#112)
Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas <robertmhaas@gmail.com> writes:

I don't understand why you think that partition-wise join needs any
new logic here; if this were a non-partitionwise join, we'd similarly
need to use the correct operator, but the existing code handles that
just fine. If the join is performed partition-wise, it should use the
same operators that would have been used by a non-partitionwise join
between the same tables.

More to the point, the appropriate operator was chosen by parse analysis.
The planner has *zero* flexibility as to which operator is involved.

BTW, I remain totally mystified as to what people think the semantics of
partitioning ought to be. Child columns can have a different type from
parent columns? Really? Why is this even under discussion? We don't
allow that in old-school inheritance, and I cannot imagine a rational
argument why partitioning should allow it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Tom Lane (#113)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 20, 2017 at 11:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, I remain totally mystified as to what people think the semantics of
partitioning ought to be. Child columns can have a different type from
parent columns? Really? Why is this even under discussion? We don't
allow that in old-school inheritance, and I cannot imagine a rational
argument why partitioning should allow it.

No, we aren't doing that. We are discussing here how to represent
partition bounds of top level join and all the intermediate joins
between A, B and C which are partitioned tables with different
partition key types. We are not discussing the column types of
children, join or simple.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#112)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 18, 2017 at 6:55 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

When we merge partition bounds from two relations with different
partition key types, the merged partition bounds need to have some
information abound the way those constants look like e.g. their
length, structure etc. That's the reason we need to store partition
key types of merged partitioning scheme. Consider a three way join (i4
JOIN i8 ON i4.x = i8.x) JOIN i2 ON (i2.x = i.x). When we compare
partition bounds of i4 and i8, we use operators for int4 and int8. The
join i4 JOIN i8 will get partition bounds by merging those of i4 and
i8. When we come to join with i2, we need to know which operators to
use for comparing the partition bounds of the join with those of i2.

So, if the partition key types of the joining relations differ (but
they have matching partitioning schemes per strategy, natts and
operator family) the partition bounds of the join are converted to the
wider type among the partition key types of the joining tree.
Actually, as I am explained earlier we could choose a wider outer type
for an OUTER join and shorter type for inner join. This type is used
as partition key type of the join. In the above case join between i4
and i8 have its partition bounds converted to i8 (or i4) and then when
it is joined with i2 the partition bounds of the join are converted to
i8 (or i2).

I don't understand why you think that partition-wise join needs any
new logic here; if this were a non-partitionwise join, we'd similarly
need to use the correct operator, but the existing code handles that
just fine. If the join is performed partition-wise, it should use the
same operators that would have been used by a non-partitionwise join
between the same tables.

I think the choice of operator depends only on the column types, and
that the "width" of those types has nothing to do with it. For
example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
operator for an A/B join or a B/C join will be the one that appears in
the query; parse analysis will have identified which specific operator
is meant based on the types of the columns. If the optimizer
subsequently decides to reorder the joins and perform the A/C join
first, it will go hunt down the operator with the same strategy number
in the same operator family that takes the type of A.x on one side and
the type of C.x on the other side. No problem. A partition-wise join
between A and C will use that same operator; again, no problem.

Your example involves joining the output of a join between i4 and i8
against i2, so it seems there is some ambiguity about what the input
type should be. But, again, the planner already copes with this
problem. In fact, the join is performed either using i4.x or i8.x --
I don't know what happens, or whether it depends on other details of
the query or the plan -- and the operator which can accept that value
on one side and i2.x on the other side is the one that gets used.

I think you are confusing join condition application and partition
bounds of a join relation. What you have described above is how
operators are chosen to apply join conditions - it picks up the
correct operator from the operator family based on the column types
being used in join condition. That it can do because the columns being
joined are both present the relations being joined, irrespective of
which pair of relations is being joined. In your example, A.x, B.x and
C.x are all present on one of the sides of join irrespective of
whether the join is executed as (AB)C, A(BC) or (AC)B.

But the problem we are trying to solve here about partition bounds of
the join relation: what should be the partition bounds of AB, BC or
AC? When we compare partition bounds of and intermediate join with
other intermediate join (e.g. AB with those of C) what operator should
be used? You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types. I am of
the opinion that we save a single set of partition bounds. We have to
then associate a data type with bounds to know binary representation
of partition bound datums. That datatype would be one of the partition
key types of joining relations. I may be wrong in using term "wider"
since its associated with the length of binary reprentation. But we
need some logic to coalesce the two data types based on the type of
join and key type on the outer side.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#115)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/04/20 15:45, Ashutosh Bapat wrote:

On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't understand why you think that partition-wise join needs any
new logic here; if this were a non-partitionwise join, we'd similarly
need to use the correct operator, but the existing code handles that
just fine. If the join is performed partition-wise, it should use the
same operators that would have been used by a non-partitionwise join
between the same tables.

I think the choice of operator depends only on the column types, and
that the "width" of those types has nothing to do with it. For
example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
operator for an A/B join or a B/C join will be the one that appears in
the query; parse analysis will have identified which specific operator
is meant based on the types of the columns. If the optimizer
subsequently decides to reorder the joins and perform the A/C join
first, it will go hunt down the operator with the same strategy number
in the same operator family that takes the type of A.x on one side and
the type of C.x on the other side. No problem. A partition-wise join
between A and C will use that same operator; again, no problem.

Your example involves joining the output of a join between i4 and i8
against i2, so it seems there is some ambiguity about what the input
type should be. But, again, the planner already copes with this
problem. In fact, the join is performed either using i4.x or i8.x --
I don't know what happens, or whether it depends on other details of
the query or the plan -- and the operator which can accept that value
on one side and i2.x on the other side is the one that gets used.

I think you are confusing join condition application and partition
bounds of a join relation. What you have described above is how
operators are chosen to apply join conditions - it picks up the
correct operator from the operator family based on the column types
being used in join condition. That it can do because the columns being
joined are both present the relations being joined, irrespective of
which pair of relations is being joined. In your example, A.x, B.x and
C.x are all present on one of the sides of join irrespective of
whether the join is executed as (AB)C, A(BC) or (AC)B.

But the problem we are trying to solve here about partition bounds of
the join relation: what should be the partition bounds of AB, BC or
AC? When we compare partition bounds of and intermediate join with
other intermediate join (e.g. AB with those of C) what operator should
be used? You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types. I am of
the opinion that we save a single set of partition bounds. We have to
then associate a data type with bounds to know binary representation
of partition bound datums. That datatype would be one of the partition
key types of joining relations. I may be wrong in using term "wider"
since its associated with the length of binary reprentation. But we
need some logic to coalesce the two data types based on the type of
join and key type on the outer side.

FWIW, I think that using any one of the partition bounds of the baserels
being partitionwise-joined should suffice as the partition bound of any
combination of joins involving two or more of those baserels, as long as
the partitioning operator of each of the baserels is in the same operator
family (I guess that *is* checked somewhere in the partitionwise join
consideration flow). IOW, partopfamily[] of all of the baserels should
match and then the join clause operators involved should belong to the
same respective operator families.

ISTM, the question here is about how to derive the partitioning properties
of joinrels from those of the baserels involved. Even if the join
conditions refer to columns of different types on two sides, as long as
the partitioning and joining is known to occur using operators of
compatible semantics, I don't understand what more needs to be considered
or done. Although, I haven't studied things in enough detail to say
anything confidently about whether join being INNER or OUTER has any
bearing on the semantics of the partitioning of the joinrels in question.
IIUC, using partitioning properties to apply partitionwise join technique
at successive join levels will be affected by the OUTER considerations
similar to how they affect what levels a give EquivalenceClass clause
could be applied without causing any semantics violations. As already
mentioned upthread, it would be a good idea to have some integration of
the partitioning considerations with the equivalence class mechanism (how
ForeignKeyOptInfo contains links to ECs comes to mind).

By the way, looking at match_expr_to_partition_keys() in your latest
patch, I wonder why not use an approach similar to calling
is_indexable_operator() that is used in match_clause_to_indexcol()? Note
that is_indexable_operator() simply checks if clause->opno is in the index
key's operator family, as returned by op_in_opfamily(). Instead I see the
following:

/*
* The clause allows partition-wise join if only it uses the same
* operator family as that specified by the partition key.
*/
if (!list_member_oid(rinfo->mergeopfamilies,
part_scheme->partopfamily[ipk1]))
continue;

But maybe I am missing something.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#116)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 20, 2017 at 3:35 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/04/20 15:45, Ashutosh Bapat wrote:

On Thu, Apr 20, 2017 at 10:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't understand why you think that partition-wise join needs any
new logic here; if this were a non-partitionwise join, we'd similarly
need to use the correct operator, but the existing code handles that
just fine. If the join is performed partition-wise, it should use the
same operators that would have been used by a non-partitionwise join
between the same tables.

I think the choice of operator depends only on the column types, and
that the "width" of those types has nothing to do with it. For
example, if the user writes .WHERE A.x = B.x AND B.x = C.x, the
operator for an A/B join or a B/C join will be the one that appears in
the query; parse analysis will have identified which specific operator
is meant based on the types of the columns. If the optimizer
subsequently decides to reorder the joins and perform the A/C join
first, it will go hunt down the operator with the same strategy number
in the same operator family that takes the type of A.x on one side and
the type of C.x on the other side. No problem. A partition-wise join
between A and C will use that same operator; again, no problem.

Your example involves joining the output of a join between i4 and i8
against i2, so it seems there is some ambiguity about what the input
type should be. But, again, the planner already copes with this
problem. In fact, the join is performed either using i4.x or i8.x --
I don't know what happens, or whether it depends on other details of
the query or the plan -- and the operator which can accept that value
on one side and i2.x on the other side is the one that gets used.

I think you are confusing join condition application and partition
bounds of a join relation. What you have described above is how
operators are chosen to apply join conditions - it picks up the
correct operator from the operator family based on the column types
being used in join condition. That it can do because the columns being
joined are both present the relations being joined, irrespective of
which pair of relations is being joined. In your example, A.x, B.x and
C.x are all present on one of the sides of join irrespective of
whether the join is executed as (AB)C, A(BC) or (AC)B.

But the problem we are trying to solve here about partition bounds of
the join relation: what should be the partition bounds of AB, BC or
AC? When we compare partition bounds of and intermediate join with
other intermediate join (e.g. AB with those of C) what operator should
be used? You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types. I am of
the opinion that we save a single set of partition bounds. We have to
then associate a data type with bounds to know binary representation
of partition bound datums. That datatype would be one of the partition
key types of joining relations. I may be wrong in using term "wider"
since its associated with the length of binary reprentation. But we
need some logic to coalesce the two data types based on the type of
join and key type on the outer side.

FWIW, I think that using any one of the partition bounds of the baserels
being partitionwise-joined should suffice as the partition bound of any
combination of joins involving two or more of those baserels, as long as
the partitioning operator of each of the baserels is in the same operator
family (I guess that *is* checked somewhere in the partitionwise join
consideration flow). IOW, partopfamily[] of all of the baserels should
match and then the join clause operators involved should belong to the
same respective operator families.

The partition bounds of different base rels may be different and we
have to compare them. Even we say that we join two tables with same
partition bounds using partitio-wise join, we need to make sure that
those partition bounds are indeed same, thus requiring to compare. And
to compare any datum we need to know its type.

ISTM, the question here is about how to derive the partitioning properties
of joinrels from those of the baserels involved. Even if the join
conditions refer to columns of different types on two sides, as long as
the partitioning and joining is known to occur using operators of
compatible semantics, I don't understand what more needs to be considered
or done. Although, I haven't studied things in enough detail to say
anything confidently about whether join being INNER or OUTER has any
bearing on the semantics of the partitioning of the joinrels in question.
IIUC, using partitioning properties to apply partitionwise join technique
at successive join levels will be affected by the OUTER considerations
similar to how they affect what levels a give EquivalenceClass clause
could be applied without causing any semantics violations. As already
mentioned upthread, it would be a good idea to have some integration of
the partitioning considerations with the equivalence class mechanism (how
ForeignKeyOptInfo contains links to ECs comes to mind).

This has been already discussed. I have showed earlier why equivalence
classes are not useful in this case.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#115)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 20, 2017 at 8:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I think you are confusing join condition application and partition
bounds of a join relation.

You're right, I misunderstood what you were talking about.

But the problem we are trying to solve here about partition bounds of
the join relation: what should be the partition bounds of AB, BC or
AC? When we compare partition bounds of and intermediate join with
other intermediate join (e.g. AB with those of C) what operator should
be used? You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types.

Well, actually, I think it is a good option, as I wrote in
/messages/by-id/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com

In that email, my principal concern was allowing partition-wise join
to succeed even with slightly different sets of partition boundaries
on the two sides of the join; in particular, if we've got A with A1 ..
A10 and B with B1 .. B10 and the DBA adds A11, I don't want
performance to tank until the DBA gets around to adding B11. Removing
the partition bounds from the PartitionScheme and storing them
per-RelOptInfo fixes that problem; the fact that it also solves this
problem of what happens when we have different data types on the two
sides looks to me like a second reason to go that way.

And there's a third reason, too, which is that the opfamily mechanism
doesn't currently provide any mechanism for reasoning about which data
types are "wider" or "narrower" in the way that you want. In general,
there's not even a reason why such a relationship has to exist;
consider two data types t1 and t2 with opclasses t1_ops and t2_ops
that are part of the same opfamily t_ops, and suppose that t1 can
represent any positive integer and t2 can represent any even integer,
or in general that each data type can represent some but not all of
the values that can be represented by the other data type. In such a
case, neither would be "wider" than the other in the sense that you
need; you essentially want to find a data type within the opfamily to
which all values of any of the types involved in the query can be cast
without error, but there is nothing today which requires such a data
type to exist, and no way to identify which one it is. In practice,
for all of the built-in opfamilies that have more than one opclass,
such a data type always exists but is not always unique -- in
particular, datetime_ops contains date_ops, timestamptz_ops, and
timestamp_ops, and either of the latter two is a plausible choice for
the "widest" data type of the three. But there's no way to figure
that out from the opfamily or opclass information we have today.

In theory, it would be possible to modify the opfamily machinery so
that every opfamily designates an optional ordering of types from
"narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
guarantee that every value of type t1 can be cast without error to a
value of type t2. But I think that's a bad plan. It means that every
opfamily created by either the core code or some extension now needs
to worry about annotating the opclass with this new information, and
we have to add to core the SQL syntax and supporting code to make that
work. If it were implementing a valuable feature which could not
practically be implemented without extending the opfamily machinery,
then I guess that's what we'd have to suck it up and incur that
complexity, but in this case it does not appear necessary. Storing
the partition bounds per-RelOptInfo makes this problem -- and a few
others -- go away.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#118)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types.

Well, actually, I think it is a good option, as I wrote in
/messages/by-id/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com

I guess, you are now confusing between partition bounds for a join
relation and partition bounds of base relation. Above paragraph is
about partition bounds of a join relation. I have already agreed that
we need to store partition bounds in RelOptInfo. For base relation
this is trivial; its RelOptInfo has to store partition bounds as
stored in the partition descriptor of corresponding partitioned table.
I am talking about partition bounds of a join relation. See below for
more explanation.

In that email, my principal concern was allowing partition-wise join
to succeed even with slightly different sets of partition boundaries
on the two sides of the join; in particular, if we've got A with A1 ..
A10 and B with B1 .. B10 and the DBA adds A11, I don't want
performance to tank until the DBA gets around to adding B11. Removing
the partition bounds from the PartitionScheme and storing them
per-RelOptInfo fixes that problem;

We have an agreement on this.

the fact that it also solves this
problem of what happens when we have different data types on the two
sides looks to me like a second reason to go that way.

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12. The
binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists. Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

And there's a third reason, too, which is that the opfamily mechanism
doesn't currently provide any mechanism for reasoning about which data
types are "wider" or "narrower" in the way that you want. In general,
there's not even a reason why such a relationship has to exist;
consider two data types t1 and t2 with opclasses t1_ops and t2_ops
that are part of the same opfamily t_ops, and suppose that t1 can
represent any positive integer and t2 can represent any even integer,
or in general that each data type can represent some but not all of
the values that can be represented by the other data type. In such a
case, neither would be "wider" than the other in the sense that you
need; you essentially want to find a data type within the opfamily to
which all values of any of the types involved in the query can be cast
without error, but there is nothing today which requires such a data
type to exist, and no way to identify which one it is. In practice,
for all of the built-in opfamilies that have more than one opclass,
such a data type always exists but is not always unique -- in
particular, datetime_ops contains date_ops, timestamptz_ops, and
timestamp_ops, and either of the latter two is a plausible choice for
the "widest" data type of the three. But there's no way to figure
that out from the opfamily or opclass information we have today.

In theory, it would be possible to modify the opfamily machinery so
that every opfamily designates an optional ordering of types from
"narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
guarantee that every value of type t1 can be cast without error to a
value of type t2. But I think that's a bad plan. It means that every
opfamily created by either the core code or some extension now needs
to worry about annotating the opclass with this new information, and
we have to add to core the SQL syntax and supporting code to make that
work. If it were implementing a valuable feature which could not
practically be implemented without extending the opfamily machinery,
then I guess that's what we'd have to suck it up and incur that
complexity, but in this case it does not appear necessary. Storing
the partition bounds per-RelOptInfo makes this problem -- and a few
others -- go away.

This seems to suggest that we can not come up with merged bounds for
join if the partition key types of joining relations differ.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#119)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Here's an updated patch set

0001-Refactor-adjust_appendrel_attrs-adjust_appendrel_att.patch
0002-Refactor-calc_nestloop_required_outer-and-allow_star.patch
These are same as earlier patch set.

0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch
0004-Modify-bound-comparision-functions-to-accept-members.patch
These are new patches to refactor partition bound comparison functions
without passing partition key directly. Changes in the first patch are
being used in this set but the second patch will be useful for more
generic bound matching.

0005-Multi-level-partitioned-table-expansion.patch
This is same as old set with minor changes. I have moved it ahead of
the other patches as we discussed offline.

0006-Canonical-partition-scheme.patch
Partition bounds are no more part of partition scheme. They appear in
RelOptInfo. We are discussing the data type handling of partition
bounds for join relation. So, I still have partopcintype in partition
scheme. There is one change though. From
ComputePartitionAttrs()->GetDefaultOpClass()/ResolveOpClass(), I
gather that partopcintype is the type used for comparison of partition
bounds instead of parttypid. When they are different they are binary
compatible. So, I have saved partopcintype in PartitionScheme instead
of parttypid and parttypmod.

0007-Canonical-partitioning-scheme-for-multi-level-partit.patch
What was earlier mult-level partition-wise join support patch is now
broken into set of patches and goes with corresponding patch for
single-level partition-wise join patch. The idea is if we agree on
changes for multi-level partitioning support and want to commit it
before partition-wise join support, we can squash those pairs into
one. This also associates the multi-level support changes with
corresponding changes for single-level support. That might be easier
to review.

0008-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch
No changes in this patch.

0009-Partition-wise-join-implementation.patch
The patch adds build_joinrel_partition_bounds() to match the partition
bounds of the relations being joined. This function is called from
try_partition_wise_join(). This function is also responsible for
creating the pairs of matching partitions. When we come to support
partition-wise joins for unequal number of partitions, this function
would change without changing rest of the code.

0010-Multi-level-partition-wise-join-implementation.patch
multi-level support

0011-Adjust-join-related-to-code-to-accept-child-relation.patch
No changes to this patch.

0012-Fix-ConvertRowtypeExpr-refs-in-join-targetlist-and-q.patch
Fixes a crash with mult-level partitioning reported by Rajkumar. Fixes
set_plan_refs code for nested ConvertRowtypeExprs corresponding to
multiple levels of partitions.

0013-Parameterized-path-fixes.patch
No changes to this patch.

0014-Reparameterize-path-across-multiple-levels-of-partit.patch
Multi-level support changes for 0013

0015-Partition-wise-join-tests.patch
0016-Multi-level-partition-wise-join-tests.patch
Added the testcases reported by Rajkumar.

On Fri, Apr 21, 2017 at 12:11 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

You seem to be suggesting that we keep as many sets of
partition bounds as there are base relations participating in the join
and then use appropriate partition bounds based on the columns in the
join conditions, so that we can use the same operator as used in the
join condition. That doesn't seem to be a good option since the
partition bounds will all have same values, only differing in their
binary representation because of differences in data types.

Well, actually, I think it is a good option, as I wrote in
/messages/by-id/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com

I guess, you are now confusing between partition bounds for a join
relation and partition bounds of base relation. Above paragraph is
about partition bounds of a join relation. I have already agreed that
we need to store partition bounds in RelOptInfo. For base relation
this is trivial; its RelOptInfo has to store partition bounds as
stored in the partition descriptor of corresponding partitioned table.
I am talking about partition bounds of a join relation. See below for
more explanation.

In that email, my principal concern was allowing partition-wise join
to succeed even with slightly different sets of partition boundaries
on the two sides of the join; in particular, if we've got A with A1 ..
A10 and B with B1 .. B10 and the DBA adds A11, I don't want
performance to tank until the DBA gets around to adding B11. Removing
the partition bounds from the PartitionScheme and storing them
per-RelOptInfo fixes that problem;

We have an agreement on this.

the fact that it also solves this
problem of what happens when we have different data types on the two
sides looks to me like a second reason to go that way.

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12. The
binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists. Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

And there's a third reason, too, which is that the opfamily mechanism
doesn't currently provide any mechanism for reasoning about which data
types are "wider" or "narrower" in the way that you want. In general,
there's not even a reason why such a relationship has to exist;
consider two data types t1 and t2 with opclasses t1_ops and t2_ops
that are part of the same opfamily t_ops, and suppose that t1 can
represent any positive integer and t2 can represent any even integer,
or in general that each data type can represent some but not all of
the values that can be represented by the other data type. In such a
case, neither would be "wider" than the other in the sense that you
need; you essentially want to find a data type within the opfamily to
which all values of any of the types involved in the query can be cast
without error, but there is nothing today which requires such a data
type to exist, and no way to identify which one it is. In practice,
for all of the built-in opfamilies that have more than one opclass,
such a data type always exists but is not always unique -- in
particular, datetime_ops contains date_ops, timestamptz_ops, and
timestamp_ops, and either of the latter two is a plausible choice for
the "widest" data type of the three. But there's no way to figure
that out from the opfamily or opclass information we have today.

In theory, it would be possible to modify the opfamily machinery so
that every opfamily designates an optional ordering of types from
"narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
guarantee that every value of type t1 can be cast without error to a
value of type t2. But I think that's a bad plan. It means that every
opfamily created by either the core code or some extension now needs
to worry about annotating the opclass with this new information, and
we have to add to core the SQL syntax and supporting code to make that
work. If it were implementing a valuable feature which could not
practically be implemented without extending the opfamily machinery,
then I guess that's what we'd have to suck it up and incur that
complexity, but in this case it does not appear necessary. Storing
the partition bounds per-RelOptInfo makes this problem -- and a few
others -- go away.

This seems to suggest that we can not come up with merged bounds for
join if the partition key types of joining relations differ.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v18.zipapplication/zip; name=pg_dp_join_patches_v18.zip
#121Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#119)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

Well, since there is no guarantee that a datatype exists which can be
used to "merge" the partition bounds in the sense that you are
describing, and even if there is one we have no opfamily
infrastructure to find out which one it is, I think it would be smart
to try to set things up so that we don't need to do that. I believe
that's probably possible.

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12.

OK, sure.

The binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists.

Agreed.

Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

Well, the join is actually being performed either on A.keycol =
C.keycol or on B.keycol = C.keycol, right? It has to be one or the
other; there's no "merged" join column in any relation's targetlist,
but only columns derived from the various baserels. So let's use that
set of bounds for the matching. It makes sense to use the set of
bounds for the matching that corresponds to the column actually being
joined, I think.

It's late here and I'm tired, but it seems like it should be possible
to relate the child joinrels of the AB join back to the child joinrels
of either A or B. (AB)1 .. (AB)10 related back to A1 .. A10 and B1 ..
B10. (AB)11 relates back to B11 but, of course not to A11, which
doesn't exist. If the join is INNER, (AB)11 is a dummy rel anyway and
actually we should probably see whether we can omit it altogether. If
the join is an outer join of some kind, there's an interesting case
where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not
on the nullable side of the join; in that case, too, (AB)11 is dummy
or nonexistent. Otherwise, assuming A is nullable, (AB)11 maps only
to B11 and not to A11. But that's absolutely right: if the join to C
uses A.keycol, either the join operator is strict and (AB)11 won't
match anything anyway, or it's not and partition-wise join is illegal
because A.keycol in (AB)11 can include not only values from X11 but
also nulls.

So, it seems to me that what you can do is loop over the childrels on
the outer side of the join. For each one, you've got a join clause
that relates the outer rel to the inner rel, and that join clause
mentions some baserel which is contained in the joinrel. So drill
down through the childrel to the corresponding partition of the
baserel and get those bounds. Then if you do the same thing for the
inner childrels, you've now got two lists of bounds, and the type on
the left matches the outer side of the join and the type on the right
matches the inner side of the join and the opfamily of the operator in
the join clause gives you a comparison operator that relates those two
types, and now you can match them up.

(We should also keep in mind the case where there are multiple columns
in the partition key.)

This seems to suggest that we can not come up with merged bounds for
join if the partition key types of joining relations differ.

Yes, I think that would be difficult.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#121)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Apr 22, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

Well, since there is no guarantee that a datatype exists which can be
used to "merge" the partition bounds in the sense that you are
describing, and even if there is one we have no opfamily
infrastructure to find out which one it is, I think it would be smart
to try to set things up so that we don't need to do that. I believe
that's probably possible.

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12.

OK, sure.

The binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists.

Agreed.

Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

Well, the join is actually being performed either on A.keycol =
C.keycol or on B.keycol = C.keycol, right? It has to be one or the
other; there's no "merged" join column in any relation's targetlist,
but only columns derived from the various baserels. So let's use that
set of bounds for the matching. It makes sense to use the set of
bounds for the matching that corresponds to the column actually being
joined, I think.

It's late here and I'm tired, but it seems like it should be possible
to relate the child joinrels of the AB join back to the child joinrels
of either A or B. (AB)1 .. (AB)10 related back to A1 .. A10 and B1 ..
B10. (AB)11 relates back to B11 but, of course not to A11, which
doesn't exist. If the join is INNER, (AB)11 is a dummy rel anyway and
actually we should probably see whether we can omit it altogether. If
the join is an outer join of some kind, there's an interesting case
where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not
on the nullable side of the join; in that case, too, (AB)11 is dummy
or nonexistent. Otherwise, assuming A is nullable, (AB)11 maps only
to B11 and not to A11. But that's absolutely right: if the join to C
uses A.keycol, either the join operator is strict and (AB)11 won't
match anything anyway, or it's not and partition-wise join is illegal
because A.keycol in (AB)11 can include not only values from X11 but
also nulls.

So, it seems to me that what you can do is loop over the childrels on
the outer side of the join. For each one, you've got a join clause
that relates the outer rel to the inner rel, and that join clause
mentions some baserel which is contained in the joinrel. So drill
down through the childrel to the corresponding partition of the
baserel and get those bounds. Then if you do the same thing for the
inner childrels, you've now got two lists of bounds, and the type on
the left matches the outer side of the join and the type on the right
matches the inner side of the join and the opfamily of the operator in
the join clause gives you a comparison operator that relates those two
types, and now you can match them up.

This assumes that datums in partition bounds have one to one mapping
with the partitions, which isn't true for list partitions. For list
partitions we have multiple datums corresponding to the items listed
associated with a given partition. So, simply looping over the
partitions of outer relations doesn't work; in fact there are two
outer relations for a full outer join, so we have to loop over both of
them together in a merge join fashion.

Consider A join B where A has partitions A1 (a, b, c), A2(e, f), A3(g,
h) and B has partitions B1 (a, b), B2 (c, d, e), B3(f, g, h). If we
just look at the partitions, we won't recognize that list item c is
repeated in A1B1 and A2B2. That can be recognized only when we loop
over the datums of A and B trying to match the partitions. We will see
that for a, b A1 and B1 match but for c A1 and B1 do not match,
instead A1 and B2 match. In one to one partition matching we will bail
out here.

I think, we have to find the base relations whose partition bounds
should be used for comparison looking at the equi-join conditions and
then compare those partition bounds to come up with the partition
bounds of join relation. That won't work straight forward either when
their are partitions missing on either sides of the join, I guess.
Needs a careful thought.

(We should also keep in mind the case where there are multiple columns
in the partition key.)

Yes. This is tricky. Consider A partitioned by (a1, a2) B partitioned
by (b1, b2) and C partitioned by (c1, c2). If the query is A join B on
(A.a1 = B.a1 and A.a2 = B.b2) join C on (C.c1 = A.a1 and C.c2 = B.b2),
we need to fetch partition bound values for a1 from A's partition
bounds and those for b1 from B's partition bounds. Create combined
partition bounds from those and then compare the combined bounds with
those of C.

After saying all that, I think we have a precedence of merged join
columns with merged data types. Consider
create table t1(a int2, b int);
create table t2 (a int4, b int);
explain verbose select * from t1 join t2 using(a);
QUERY PLAN
-------------------------------------------------------------------------
Merge Join (cost=327.25..745.35 rows=27120 width=12)
Output: t2.a, t1.b, t2.b
Merge Cond: (t2.a = t1.a)
-> Sort (cost=158.51..164.16 rows=2260 width=8)
Output: t2.a, t2.b
Sort Key: t2.a
-> Seq Scan on public.t2 (cost=0.00..32.60 rows=2260 width=8)
Output: t2.a, t2.b
-> Sort (cost=168.75..174.75 rows=2400 width=6)
Output: t1.b, t1.a
Sort Key: t1.a
-> Seq Scan on public.t1 (cost=0.00..34.00 rows=2400 width=6)
Output: t1.b, t1.a
(13 rows)

When using clause is used the columns specified by using clause from
the joining relations are merged into a single column. Here it has
used a "wider" type column t2.a as the merged column for t1.a and
t2.a. The logic is in buildMergedJoinVar().

Probably we want to build merged partition bounds for a join relation
where partition keys of the joining relations are different using a
single data type provided by the same logic as buildMergedJoinVar()
and attach those to the join relation.

[1]: http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312629.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#122)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Apr 21, 2017 at 7:59 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

Here's an updated patch set

Hi,

I have applied v18 patches and got a crash in m-way joins when partition
ranges differ, below are steps to reproduce this.

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0,
599, 2) i;
ANALYZE prt1;

CREATE TABLE prt4_n (a int, b int, c text) PARTITION BY RANGE(a);
CREATE TABLE prt4_n_p1 PARTITION OF prt4_n FOR VALUES FROM (0) TO (300);
CREATE TABLE prt4_n_p2 PARTITION OF prt4_n FOR VALUES FROM (300) TO (500);
CREATE TABLE prt4_n_p3 PARTITION OF prt4_n FOR VALUES FROM (500) TO (600);
INSERT INTO prt4_n SELECT i, i, to_char(i, 'FM0000') FROM
generate_series(0, 599, 2) i;
ANALYZE prt4_n;

SET enable_partition_wise_join = on ;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt4_n t2, prt1 t3 WHERE t1.a =
t2.a AND t2.a = t3.a;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

#124Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#123)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Apr 24, 2017 at 5:02 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Fri, Apr 21, 2017 at 7:59 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's an updated patch set

Hi,

I have applied v18 patches and got a crash in m-way joins when partition
ranges differ, below are steps to reproduce this.

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0,
599, 2) i;
ANALYZE prt1;

CREATE TABLE prt4_n (a int, b int, c text) PARTITION BY RANGE(a);
CREATE TABLE prt4_n_p1 PARTITION OF prt4_n FOR VALUES FROM (0) TO (300);
CREATE TABLE prt4_n_p2 PARTITION OF prt4_n FOR VALUES FROM (300) TO (500);
CREATE TABLE prt4_n_p3 PARTITION OF prt4_n FOR VALUES FROM (500) TO (600);
INSERT INTO prt4_n SELECT i, i, to_char(i, 'FM0000') FROM generate_series(0,
599, 2) i;
ANALYZE prt4_n;

SET enable_partition_wise_join = on ;
EXPLAIN (COSTS OFF)
SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt4_n t2, prt1 t3 WHERE t1.a =
t2.a AND t2.a = t3.a;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks Rajkumar for the report. When two relations with same partition
scheme but different partition bounds are joined, their join relation
has partition scheme set, but not partition bounds since we do not
have logic to merge such partition bounds. When this join relation is
joined further with other relation with same partition scheme, the
code assumed that the join relation had partition bounds set. So the
corresponding assertion failed. Instead, we should treat this
condition same as the case of joining relations with different
partition bounds and not use partition wise join for this join. This
case may be further improved in the next set of patches by trying to
merge partition bounds so that partition-wise join can be applied.
Here's set of patches which fixes the issue.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v19.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v19.tar.gz
#125Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#122)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

This assumes that datums in partition bounds have one to one mapping
with the partitions, which isn't true for list partitions. For list
partitions we have multiple datums corresponding to the items listed
associated with a given partition. So, simply looping over the
partitions of outer relations doesn't work; in fact there are two
outer relations for a full outer join, so we have to loop over both of
them together in a merge join fashion.

Maybe so, but my point is that it can be done with the original types,
without converting anything to a different type.

When using clause is used the columns specified by using clause from
the joining relations are merged into a single column. Here it has
used a "wider" type column t2.a as the merged column for t1.a and
t2.a. The logic is in buildMergedJoinVar().

That relies on select_common_type(), which can error out if it can't
find a common type. That's OK for the current uses of that function,
because if it fails it means that the query is invalid. But it's not
OK for what you want here, because it's not OK to error out due to
inability to do a partition-wise join when a non-partition-wise join
would have worked. Also, note that all select_common_type() is really
doing is looking for the type within the type category that is marked
typispreferred, or else checking which direction has an implicit cast.
Neither of those things guarantee the property you want here, namely
that the "common" type is in the same opfamily and can store every
value of any of the input types without loss of precision. So I don't
think you can rely on that.

I'm going to say this one more time: I really, really, really think
you need to avoid trying to convert the partition bounds to a common
type. I said before that the infrastructure to do that is not present
in our type system, and I'm pretty sure that statement is 100%
correct. The fact that you can find other cases where we do something
sorta like that but in a different case with different requirements
doesn't make that false.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#126Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#125)
Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas <robertmhaas@gmail.com> writes:

I'm going to say this one more time: I really, really, really think
you need to avoid trying to convert the partition bounds to a common
type. I said before that the infrastructure to do that is not present
in our type system, and I'm pretty sure that statement is 100%
correct. The fact that you can find other cases where we do something
sorta like that but in a different case with different requirements
doesn't make that false.

It's not just a matter of lack of infrastructure: the very attempt is
flawed, because in some cases there simply isn't a supertype that can
hold all values of both types. An easy counterexample is float8 vs
numeric: you can't convert float8 'Infinity' to numeric, but also there
are values of numeric that can't be converted to float8 without overflow
and/or loss of precision.

The whole business of precision loss makes things very touchy for almost
anything involving float and a non-float type, actually.

What I'm going to ask one more time, though, is why we are even discussing
this. Surely the partition bounds of a partitioned table must all be of
the same type already. If there is a case where they are not, that is
a bug we had better close off before v10 ships, not a feature that we
need to write a lot of code to accommodate.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#127Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#126)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Apr 26, 2017 at 12:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

What I'm going to ask one more time, though, is why we are even discussing
this. Surely the partition bounds of a partitioned table must all be of
the same type already. If there is a case where they are not, that is
a bug we had better close off before v10 ships, not a feature that we
need to write a lot of code to accommodate.

This question was answered before, by Ashutosh.

/messages/by-id/CAFjFpRfaKSO4YZjVv7jkcMEMVgDcnqc4yhqVWhO5gczB5mW8eQ@mail.gmail.com

Since you either didn't read his answer, or else didn't understand it
and didn't bother asking for clarification, I'll try to be more blunt:
of course all of the partition bounds of a single partitioned table
have to be of the same type. We're not talking about that, because no
kidding. This thread is about the possibility -- in a future release
-- of implementing a join between two different partitioned tables by
joining each pair of matching partitions. To do that, you need the
tables to be compatibly partitioned, which requires that the
partitioning columns use the same opfamily for each partitioning
column but not necessarily that the types be the same. Making
partition-wise join work in the case where the partitioning columns
are of different types within an opfamily (like int4 vs. int8) is
giving Ashutosh a bit of trouble. So this is about a cross-type join,
not multiple types within a single partitioning hierarchy, as you
might also gather from the subject line of this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#127)
Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas <robertmhaas@gmail.com> writes:

So this is about a cross-type join,
not multiple types within a single partitioning hierarchy, as you
might also gather from the subject line of this thread.

OK, but I still don't understand why any type conversion is needed
in such a case. The existing join estimators don't try to do that,
for the good and sufficient reasons you and I have already mentioned.
They just apply the given cross-type join operator, and whatever
cross-type selectivity estimator might be associated with it, and
possibly other cross-type operators obtained from the same btree
opfamily.

The minute you get into trying to do any type conversion that is not
mandated by the semantics of the query as written, you're going to
have problems.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#129Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#128)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Apr 26, 2017 at 12:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

So this is about a cross-type join,
not multiple types within a single partitioning hierarchy, as you
might also gather from the subject line of this thread.

OK, but I still don't understand why any type conversion is needed
in such a case. The existing join estimators don't try to do that,
for the good and sufficient reasons you and I have already mentioned.
They just apply the given cross-type join operator, and whatever
cross-type selectivity estimator might be associated with it, and
possibly other cross-type operators obtained from the same btree
opfamily.

The minute you get into trying to do any type conversion that is not
mandated by the semantics of the query as written, you're going to
have problems.

There is no daylight whatsoever between us on this issue.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#130Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#125)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Apr 26, 2017 at 9:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

This assumes that datums in partition bounds have one to one mapping
with the partitions, which isn't true for list partitions. For list
partitions we have multiple datums corresponding to the items listed
associated with a given partition. So, simply looping over the
partitions of outer relations doesn't work; in fact there are two
outer relations for a full outer join, so we have to loop over both of
them together in a merge join fashion.

Maybe so, but my point is that it can be done with the original types,
without converting anything to a different type.

Theoretically, I agree with this. But practically the implementation
is lot more complex than what you have described in the earlier mails.
I am afraid, that the patch with those changes will be a lot harder to
review and commit. Later in this mail, I will try to explain some of
the complexities.

I'm going to say this one more time: I really, really, really think
you need to avoid trying to convert the partition bounds to a common
type. I said before that the infrastructure to do that is not present
in our type system, and I'm pretty sure that statement is 100%
correct. The fact that you can find other cases where we do something
sorta like that but in a different case with different requirements
doesn't make that false.

Ok. Thanks for the explanation.

The current design and implementation is for a restricted case where
the partition bounds, partition key types and numbers match exactly.
We want to commit an implementation which is reasonably extensible and
doesn't require a lot of changes when we add more capabilities. Some
of the extensions we discussed are as follows:
1. Partition-wise join when the existing partitions have matching
bounds/lists but there can be extra partitions on either side of the
join (between base relations or join relations) without a matching
partition on the other side.\
2. Partition-wise join when the partition bounds/lists do not match
exactly but there is 1:1 or 1:0 or 0:1 mapping between the partitions
which can contribute to the final result. E.g. A (0-100, 100 - 150,
200-300), B (0-50, 125-200, 300-400)
3. Partition-wise join when the partition key types do not match, but
there's a single opfamily being used for partitioning.
4. Partition-wise join where 1:m or m:n mapping exists between
partitions of the joining relations.

First one is clearly something that we will need. We may add it in the
first commit or next commit, but it will be needed pretty soon (v11?).
To me 2nd is more important than the 3rd one. You may have a different
view. We will expect 3rd optimization to work with all the prior
optimizations. I am restricting myself from thinking about 4th one
since that requires ganging together multiple RelOptInfos as a single
RelOptInfo while joining, something we don't have infrastruture for.

In case of first goal, supporting INNER joins and OUTER joins where
the partitions are missing on the OUTER side but not inner side are
easier. In those cases we just drop those partitions and corresponding
bounds/lists from the join. For a FULL OUTER join, where both sides
act as OUTER as well as INNER, we will need exact mapping between the
partitions. For supporting OUTER joins where partitions on the INNER
sides can be missing, we need to create some "dummy" relations
representing the missing partitions so that we have OUTER rows with
NULL inner side. This requires giving those dummy relations some
relids and thus in case of base relations we may need to inject some
dummy children. This may mean that we have to expand simple_rel_array
as part of outer join, may or may not require adding new
AppendRelInfos and so on. We are basically breaking an assumption that
base relations can not be introduced while planning joins and that
might require some rework in the existing infrastructure. There might
be other ways to introduce dummy relations during join planning, but I
haven't really thought through the problem.

The third goal requires that the partition bounds be compared based on
the partition keys present in the equi-join. While matching the
partitions to be joined, the partition bounds corresponding the base
relation whose partition keys appear in the equi-join are used for
comparison using support function corresponding to the data types of
partition keys. This requires us to associate the partitions of a join
with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
(notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
(X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
assuming strict operators this can be executed as (AB)C or A(BC). AB
will have partitions A1B1, A2B3 since there is no matching bound of A
for B2 and A is outer relation. A1B1 is associated with bound X1 of A
and C both. A2B3 is associated with bound of X3, which happens to be
2nd bound of A but third of B. When we join (AB) with C, we should
notice that C1 goes with A1B1, C2 doesn't have any matching partition
in AB and C3 goes with A2B3. If we compare bounds of B with C without
any transformation we will know C2 matches B2, but we need to look at
the children of AB to realize that B2 isn't present in any of the
children and thus C2 should not be joined with any partition of AB.
That usually looks a quadratic order operation on the number of
partitions. The complexity can be reduced by maintaining as many
partition bounds as the number of base relations participating in the
join (an idea, I have floated earlier [1]http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312916.html) I don't elaborate it here
to avoid digression. There's also the complexity of an N-way join with
multiple partition keys and joins on partition keys from different
relations as discussed in [1]http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312916.html. There may be more involved cases, that
I haven't thought about. In short, implementation for 1st and 3rd
optimization together looks fairly complex.

Add to this the 2nd optimization and it becomes still more complex.

In order to keep the patches manageable to implement review and
commit, I am proposing following approach.

1. Implement first optimization on top of the current patches, which
enforces that the partition key datatypes of the joining relations
match. I am right now working on that patch. Do this for INNER join
and OUTER join where partitions are missing on the OUTER side and not
INNER side.
As a side note, the existing partition bound comparison functions are
tied to PartitionKey structure and require complete set of bounds from
partitioned relation. Both of those are not applicable anymore,
PartitionKey structure is not available for join and we have to
compare individual bounds in case of join as against one probe with a
complete set. This refactoring did eat some time.

2. Implement support for OUTER join where partitions can be missing
from either side.

3. Implement support for partition-wise join with different partition key types.

All those implementation will be different patches on top of v18 patches.

Given the complexities involved in 2 and 3, I am not sure which order
I should attack them. I don't have any estimates as to how much time
each of those are going to require. May be a couple of months, but I
am not sure.

Obviously we have to wait till the first commitfest to commit the
first version of the patch. So, based on the status at time, we can
decide what goes in the first commit of this feature and adjust the
patch set accordingly.

Thoughts/comments?

[1]: http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312916.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#130)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The third goal requires that the partition bounds be compared based on
the partition keys present in the equi-join. While matching the
partitions to be joined, the partition bounds corresponding the base
relation whose partition keys appear in the equi-join are used for
comparison using support function corresponding to the data types of
partition keys. This requires us to associate the partitions of a join
with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
(notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
(X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
assuming strict operators this can be executed as (AB)C or A(BC). AB
will have partitions A1B1, A2B3 since there is no matching bound of A
for B2 and A is outer relation. A1B1 is associated with bound X1 of A
and C both. A2B3 is associated with bound of X3, which happens to be
2nd bound of A but third of B. When we join (AB) with C, we should
notice that C1 goes with A1B1, C2 doesn't have any matching partition
in AB and C3 goes with A2B3. If we compare bounds of B with C without
any transformation we will know C2 matches B2, but we need to look at
the children of AB to realize that B2 isn't present in any of the
children and thus C2 should not be joined with any partition of AB.

Sure.

That usually looks a quadratic order operation on the number of
partitions.

Now that I don't buy. Certainly, for range partitions, given a list
of ranges of length M and another of length N, this can be done in
O(M+N) time by merge-joining the lists of bounds. You pointed out
upthread that for list partitions, things are a bit complicated
because a single list partition can contain multiple values which are
not necessarily contiguous, but I think that this can still be done in
O(M+N) time. Sort all of the bounds, associating each one to a
partition, and do a merge pass; whenever two bounds match, match the
two corresponding partitions, but if one of those partitions is
already matched to some other partition, then fail.

For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN
(2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4). The sorted
bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6. The first value in both
lists is a 1, so the corresponding partitions A1 and B1 are matched.
The second value in both lists is a 2, so the corresponding partitions
A2 and B2 are matched. Then we hit a 3 on the A side that has no
match on the B side, but that's fine; we don't need to do anything.
If the partition on the A side never got a mapping at any point during
this merge pass, we'd eventually need to match it to a dummy partition
(unless this is an inner join) but it's already mapped to B1 so no
problem. Then we hit a 4 which says that A2 must match B2, which is
consistent with what we already determine; no problem. Then we hit
another value that only exists on the A side, which is fine just as
before. Finally we hit a 6 on each side, which means that A2 must
match B1, which is inconsistent with the existing mappings so we give
up; no partitionwise join is possible here.

The complexity can be reduced by maintaining as many
partition bounds as the number of base relations participating in the
join (an idea, I have floated earlier [1]) I don't elaborate it here
to avoid digression. There's also the complexity of an N-way join with
multiple partition keys and joins on partition keys from different
relations as discussed in [1]. There may be more involved cases, that
I haven't thought about. In short, implementation for 1st and 3rd
optimization together looks fairly complex.

I spent some time thinking about this today and I think I see how we
could make it work: keep a single set of bounds for each join
relation, but record the type of each bound. For example, suppose we
are full joining relation i2, with an int2 partition column, which has
partitions i2a from 0 to 10000 and i2b from 20000 to 30000, to
relation i4, with an int4 partition column, which has partitions i4a
from 5000 to 15000 and i4b from 25000 to 35000. We end up with a
joinrel with 2 partitions. The first goes from 0 (stored as an int2)
to 15000 (stored as an int4) and the second goes from 20000 (stored as
an int2) to 35000 (stored as an int4). If we subsequently need to
merge these bounds with yet another relation at a higher join level,
we can use the opfamily (which is common) to dig out the right
cross-type operator for each comparison we may need to perform, based
on the precise types of the datums being compared. Of course, we
might not find an appropriate cross-type operator in some cases,
because an opfamily isn't required to provide that, so then we'd have
to fail gracefully somehow, but that could be done.

Having said that I think we could make this work, I'm starting to
agree with you that it will add more complexity than it's worth.
Needing to keep track of the type of every partition bound
individually seems like a real nuisance, and it's not likely to win
very often because, realistically, people should and generally will
use the same type for the partitioning column in all of the relevant
tables. So I'm going to revise my position and say it's fine to just
give up on partitionwise join unless the types match exactly, but I
still think we should try to cover the cases where the bounds don't
match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
optimizations 1 and 2 from your list of 4). I agree that ganging
partitions (optimization 4 from your list) is not something to tackle
right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#131)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Apr 28, 2017 at 1:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 27, 2017 at 3:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The third goal requires that the partition bounds be compared based on
the partition keys present in the equi-join. While matching the
partitions to be joined, the partition bounds corresponding the base
relation whose partition keys appear in the equi-join are used for
comparison using support function corresponding to the data types of
partition keys. This requires us to associate the partitions of a join
with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
(notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
(X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
assuming strict operators this can be executed as (AB)C or A(BC). AB
will have partitions A1B1, A2B3 since there is no matching bound of A
for B2 and A is outer relation. A1B1 is associated with bound X1 of A
and C both. A2B3 is associated with bound of X3, which happens to be
2nd bound of A but third of B. When we join (AB) with C, we should
notice that C1 goes with A1B1, C2 doesn't have any matching partition
in AB and C3 goes with A2B3. If we compare bounds of B with C without
any transformation we will know C2 matches B2, but we need to look at
the children of AB to realize that B2 isn't present in any of the
children and thus C2 should not be joined with any partition of AB.

Sure.

That usually looks a quadratic order operation on the number of
partitions.

Now that I don't buy. Certainly, for range partitions, given a list
of ranges of length M and another of length N, this can be done in
O(M+N) time by merge-joining the lists of bounds. You pointed out
upthread that for list partitions, things are a bit complicated
because a single list partition can contain multiple values which are
not necessarily contiguous, but I think that this can still be done in
O(M+N) time. Sort all of the bounds, associating each one to a
partition, and do a merge pass; whenever two bounds match, match the
two corresponding partitions, but if one of those partitions is
already matched to some other partition, then fail.

For example, consider A1 FOR VALUES IN (1,3,5), A2 FOR VALUES IN
(2,4,6), B1 FOR VALUES IN (1,6), B2 FOR VALUES IN (2,4). The sorted
bounds for A are 1,2,3,4,5,6; for B, 1,2,4,6. The first value in both
lists is a 1, so the corresponding partitions A1 and B1 are matched.
The second value in both lists is a 2, so the corresponding partitions
A2 and B2 are matched. Then we hit a 3 on the A side that has no
match on the B side, but that's fine; we don't need to do anything.
If the partition on the A side never got a mapping at any point during
this merge pass, we'd eventually need to match it to a dummy partition
(unless this is an inner join) but it's already mapped to B1 so no
problem. Then we hit a 4 which says that A2 must match B2, which is
consistent with what we already determine; no problem. Then we hit
another value that only exists on the A side, which is fine just as
before. Finally we hit a 6 on each side, which means that A2 must
match B1, which is inconsistent with the existing mappings so we give
up; no partitionwise join is possible here.

For two-way join this works and is fairly straight-forward. I am
assuming that A an B are base relations and not joins. But making it
work for N-way join is the challenge. I don't see your example
describing that. But I think, given your revised position below, we
don't need to get this right at this point. Remember, that the
paragraph was about 3rd goal, which according to your revised position
is now deferred.

Having said that I think we could make this work, I'm starting to
agree with you that it will add more complexity than it's worth.
Needing to keep track of the type of every partition bound
individually seems like a real nuisance, and it's not likely to win
very often because, realistically, people should and generally will
use the same type for the partitioning column in all of the relevant
tables. So I'm going to revise my position and say it's fine to just
give up on partitionwise join unless the types match exactly, but I
still think we should try to cover the cases where the bounds don't
match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
optimizations 1 and 2 from your list of 4). I agree that ganging
partitions (optimization 4 from your list) is not something to tackle
right now.

Good. I will have a more enjoyable vacation now.

Do you still want the patition key type to be out of partition scheme?
Keeping it there means we match it only once and save it only at a
single place. Otherwise, it will have to be stored in RelOptInfo of
the partitioned table and match it for every pair of joining
relations.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#132)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Apr 28, 2017 at 1:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

For two-way join this works and is fairly straight-forward. I am
assuming that A an B are base relations and not joins. But making it
work for N-way join is the challenge.

I don't think it's much different, is it? Anyway, I'm going to
protest if your algorithm for merging bounds takes any more than
linear time, regardless of what else we decide.

Having said that I think we could make this work, I'm starting to
agree with you that it will add more complexity than it's worth.
Needing to keep track of the type of every partition bound
individually seems like a real nuisance, and it's not likely to win
very often because, realistically, people should and generally will
use the same type for the partitioning column in all of the relevant
tables. So I'm going to revise my position and say it's fine to just
give up on partitionwise join unless the types match exactly, but I
still think we should try to cover the cases where the bounds don't
match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
optimizations 1 and 2 from your list of 4). I agree that ganging
partitions (optimization 4 from your list) is not something to tackle
right now.

Good. I will have a more enjoyable vacation now.

Phew, what a relief. :-)

Do you still want the patition key type to be out of partition scheme?
Keeping it there means we match it only once and save it only at a
single place. Otherwise, it will have to be stored in RelOptInfo of
the partitioned table and match it for every pair of joining
relations.

The only reason for removing things from the PartitionScheme was if
they didn't need to be consistent across all tables. Deciding that
the type is one of the things that has to match means deciding it
should be in the PartitionScheme, not the RelOptInfo.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#108)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

There's a relevant comment in 0006, build_joinrel_partition_info()
(probably that name needs to change, but I will do that once we have
settled on design)
+   /*
+    * Construct partition keys for the join.
+    *
+    * An INNER join between two partitioned relations is partition by key
+    * expressions from both the relations. For tables A and B
partitioned by a and b
+    * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
+    * and B.b.
+    *
+    * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+    * B.b NULL. These rows may not fit the partitioning conditions imposed on
+    * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
+    * Strictly speaking, partition keys of an OUTER join should include
+    * partition key expressions from the OUTER side only. Consider a join like
+    * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
+    * include B.b as partition key expression for (AB), it prohibits us from
+    * using partition-wise join when joining (AB) with C as there is no
+    * equi-join between partition keys of joining relations. But two NULL
+    * values are never equal and no two rows from mis-matching partitions can
+    * join. Hence it's safe to include B.b as partition key expression for
+    * (AB), even though rows in (AB) are not strictly partitioned by B.b.
+    */

I think that also needs to be reviewed carefully.

The following passage from src/backend/optimizer/README seems highly relevant:

===
The planner's treatment of outer join reordering is based on the following
identities:

1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
= (A innerjoin C on (Pac)) leftjoin B on (Pab)

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
= (A leftjoin C on (Pac)) leftjoin B on (Pab)

3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
= A leftjoin (B leftjoin C on (Pbc)) on (Pab)

Identity 3 only holds if predicate Pbc must fail for all-null B rows
(that is, Pbc is strict for at least one column of B). If Pbc is not
strict, the first form might produce some rows with nonnull C columns
where the second form would make those entries null.
===

In other words, I think your statement that null is never equal to
null is a bit imprecise. Somebody could certainly create an operator
that is named "=" which returns true in that case, and then they could
say, hey, two nulls are equal (when you use that operator). The
argument needs to be made in terms of the formal properties of the
operator.

[.. some portion clipped .. ]

The relevant logic is in have_partkey_equi_join:

+               /* Skip clauses which are not equality conditions. */
+               if (rinfo->hashjoinoperator == InvalidOid &&
!rinfo->mergeopfamilies)
+                       continue;

Actually, I think the hashjoinoperator test is formally and
practically unnecessary here; lower down there is a test to see
whether the partitioning scheme's operator family is a member of
rinfo->mergeopfamilies, which will certainly fail if we got through
this test with rinfo->mergeopfamilies == NIL just on the strength of
rinfo->hashjoinoperator != InvalidOid. So you can just bail out if
rinfo->mergeopfamilies == NIL. But the underlying point here is that
the only thing you really know about the function is that it's got to
be a strategy-3 operator in some btree opclass; if that guarantees
strictness, then so be it -- but I wasn't able to find anything in the
code or documentation off-hand that supports that contention, so we
might need to think a bit more about why (or if) this is guaranteed to
be true.

Partition-wise joins
may be happy including partition keys from all sides, but
partition-wise aggregates may not be, esp. when pushing complete
aggregation down to partitions. In that case, rows with NULL partition
key, which falls on nullable side of join, will be spread across
multiple partitions. Proabably, we should separate nullable and
non-nullable partition key expressions.

I don't think I understand quite what you're getting at here. Can you
spell this out in more detail? To push an aggregate down to
partitions, you need the grouping key to match the applicable
partition key, and the partition key shouldn't allow nulls in more
than one place. Now I think your point may be that outer join
semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
good test case for partitionwise aggregate. I'd be inclined to think
that we should just give up on partitionwise aggregate in such cases;
it's not worth trying to optimize such a weird query, at least IMHO.
(Does this sort of case ever happen with joins? I think not, as long
as the join operator is strict.)

I am revisiting NULL equality in the context of merging partition
bounds. In [1]https://www.postgresql.org/docs/devel/static/functions-comparison.html paragraphs following

--
Do not write expression = NULL because NULL is not “equal to” NULL.
(The null value represents an unknown value, and it is not known
whether two unknown values are equal.)

--

seem to indicate that an equality operator should never return true
for two NULL values since it would never know whether two NULL
(unknown) values are same or not. In a paragraph above, Robert stated
that

In other words, I think your statement that null is never equal to
null is a bit imprecise. Somebody could certainly create an operator
that is named "=" which returns true in that case, and then they could
say, hey, two nulls are equal (when you use that operator). The
argument needs to be made in terms of the formal properties of the
operator.

But in case a user has written an = operator which returns true for
two NULL values, per description in [1]https://www.postgresql.org/docs/devel/static/functions-comparison.html, that comparison operator is
flawed and
using that operator is going to result in SQL-standard-incompliant
behaviour. I have tried to preserve all the relevant portions of
discussion in this mail. Am I missing something?

[1]: https://www.postgresql.org/docs/devel/static/functions-comparison.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#134)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, May 18, 2017 at 4:38 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

But in case a user has written an = operator which returns true for
two NULL values, per description in [1], that comparison operator is
flawed and
using that operator is going to result in SQL-standard-incompliant
behaviour. I have tried to preserve all the relevant portions of
discussion in this mail. Am I missing something?

Yes. You're confusing friendly advice about how to write good SQL
with internals documentation about how the system actually works. The
documentation we have about how operator classes and index methods and
so forth actually work under the hood is in
https://www.postgresql.org/docs/devel/static/xindex.html -- as a
developer, that's what you should be looking at.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#133)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Apr 29, 2017 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 28, 2017 at 1:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

For two-way join this works and is fairly straight-forward. I am
assuming that A an B are base relations and not joins. But making it
work for N-way join is the challenge.

I don't think it's much different, is it? Anyway, I'm going to
protest if your algorithm for merging bounds takes any more than
linear time, regardless of what else we decide.

Having said that I think we could make this work, I'm starting to
agree with you that it will add more complexity than it's worth.
Needing to keep track of the type of every partition bound
individually seems like a real nuisance, and it's not likely to win
very often because, realistically, people should and generally will
use the same type for the partitioning column in all of the relevant
tables. So I'm going to revise my position and say it's fine to just
give up on partitionwise join unless the types match exactly, but I
still think we should try to cover the cases where the bounds don't
match exactly but only 1:1 or 1:0 or 0:1 mappings are needed (iow,
optimizations 1 and 2 from your list of 4). I agree that ganging
partitions (optimization 4 from your list) is not something to tackle
right now.

Good. I will have a more enjoyable vacation now.

Phew, what a relief. :-)

Do you still want the patition key type to be out of partition scheme?
Keeping it there means we match it only once and save it only at a
single place. Otherwise, it will have to be stored in RelOptInfo of
the partitioned table and match it for every pair of joining
relations.

The only reason for removing things from the PartitionScheme was if
they didn't need to be consistent across all tables. Deciding that
the type is one of the things that has to match means deciding it
should be in the PartitionScheme, not the RelOptInfo.

Here's set of patches rebased on latest head.

I spent some time trying to implement partition-wise join when
partition bounds do not match exactly but there's 1:1, 1:0 or 0:1
mapping between partitions. A WIP patch 0017 is included in the set
for the same. The patch is not complete, it doesn't support range
partitions and needs some bugs to be fixed for list partitions. Also
because of the way it crafts partition bounds for a join, it leaks
memory consumed by partition bounds for every pair of joining
relations. I will work on fixing those issues. That patch is pretty
large now. So, I think we will have to commit it separately on top of
basic partition-wise join implementation. But you will see that it has
minimal changes to the basic partition-wise join code.

I rewrote code handling partition keys on the nullable side of the
join. Now we store partition keys from nullable and non-nullable
relations separately. The partition keys from nullable relations are
matched only when the equality operator is strict. This is explained
in details the comments in match_expr_to_partition_keys() and
build_joinrel_partition_info().

Also please note that since last patch set I have paired the
multi-level partition-wise join support patches with single-level
partition-wise join patches providing corresponding functionality.

[1]: /messages/by-id/CAFjFpRd9ebX225KhuvYXQRBuk9NrVJfPzHqGPGqpze+qvH0xmw@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v20.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v20.tar.gz
#137Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#136)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

Here's set of patches rebased on latest head.

In an attempt to test this set of patches, I found that not all of the
patches could be applied on latest head-- commit
08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
Might be in need of rebasing.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

#138Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#137)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of patches rebased on latest head.

In an attempt to test this set of patches, I found that not all of the
patches could be applied on latest head-- commit
08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
Might be in need of rebasing.

Thanks Rafia for your interest. I have started rebasing the patches on
the latest head. I am expecting it to take some time. Will update the
thread with the patches once I am done rebasing them.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#138)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Jul 4, 2017 at 10:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of patches rebased on latest head.

In an attempt to test this set of patches, I found that not all of the
patches could be applied on latest head-- commit
08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
Might be in need of rebasing.

Thanks Rafia for your interest. I have started rebasing the patches on
the latest head. I am expecting it to take some time. Will update the
thread with the patches once I am done rebasing them.

Here are patches rebased.

As mentioned in my previous mail [1]/messages/by-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com, the last two patches are not
complete but are included, so that the reviewer can see the changes we
will have to make when we go towards more general partition-wise join.
Please use patches upto 0015, which implement 1:1 partition mapping
for benchmarking and testing.

[1]: /messages/by-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v21.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v21.tar.gz
#140Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#139)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Jul 10, 2017 at 3:57 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Jul 4, 2017 at 10:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Jun 30, 2017 at 2:53 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Mon, May 22, 2017 at 12:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of patches rebased on latest head.

In an attempt to test this set of patches, I found that not all of the
patches could be applied on latest head-- commit
08aed6604de2e6a9f4d499818d7c641cbf5eb9f7
Might be in need of rebasing.

Thanks Rafia for your interest. I have started rebasing the patches on
the latest head. I am expecting it to take some time. Will update the
thread with the patches once I am done rebasing them.

Here are patches rebased.

As mentioned in my previous mail [1], the last two patches are not
complete but are included, so that the reviewer can see the changes we
will have to make when we go towards more general partition-wise join.
Please use patches upto 0015, which implement 1:1 partition mapping
for benchmarking and testing.

[1] /messages/by-id/CAFjFpRdF8GpmSjjn0fm85cMW2iz+r3MQJQ_HC0eDATzWSv5buw@mail.gmail.com

Here's revised patch set with only 0004 revised. That patch deals with
creating multi-level inheritance hierarchy from multi-level partition
hierarchy. The original logic of recursively calling
inheritance_planner()'s guts over the inheritance hierarchy required
that for every such recursion we flatten many lists created by that
code. Recursion also meant that root->append_rel_list is traversed as
many times as the number of partitioned partitions in the hierarchy.
Instead the revised version keep the iterative shape of
inheritance_planner() intact, thus naturally creating flat lists,
iterates over root->append_rel_list only once and is still easy to
read and maintain.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v22.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v22.tar.gz
#141Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#140)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jul 14, 2017 at 12:32 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

Here's revised patch set with only 0004 revised. That patch deals with
creating multi-level inheritance hierarchy from multi-level partition
hierarchy. The original logic of recursively calling
inheritance_planner()'s guts over the inheritance hierarchy required
that for every such recursion we flatten many lists created by that
code. Recursion also meant that root->append_rel_list is traversed as
many times as the number of partitioned partitions in the hierarchy.
Instead the revised version keep the iterative shape of
inheritance_planner() intact, thus naturally creating flat lists,
iterates over root->append_rel_list only once and is still easy to
read and maintain.

On testing this patch for TPC-H (for scale factor 20) benchmark I found a
regression for Q21, on head it was taking some 600 seconds and with this
patch it is taking 3200 seconds. This comparison is on the same partitioned
database, one using the partition wise join patch and other is without it.
The execution time of Q21 on unpartitioned head is some 300 seconds. The
explain analyse output for each of these cases is attached.

This suggests that partitioning is not a suitable strategy for this query,
but then may be partition wise should not be picked for such a case to
aggravate the performance issue.

The details of the setup is as follows,

Server parameter settings,
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
enable_partition_wise_join = on

Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders

Lineitem -
Partition key = l_orderkey
No of partitions = 18

Orders -
Partition key = o_orderkey
No of partitions = 11

Commit id - 42171e2cd23c8307bbe0ec64e901f58e297db1c3

I chose orderkey as the partition key since it is the primary key of orders
and along with l_linenumber it forms the primary key for lineitem.
For the above mentioned settings, there was no other query that used
partitioned wise join.

Please let me know if any more information is required regarding this
experimentation.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

part_reg.zipapplication/zip; name=part_reg.zip
#142Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Rafia Sabih (#141)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 19, 2017 at 12:24 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On testing this patch for TPC-H (for scale factor 20) benchmark I found a
regression for Q21, on head it was taking some 600 seconds and with this
patch it is taking 3200 seconds. This comparison is on the same partitioned
database, one using the partition wise join patch and other is without it.
The execution time of Q21 on unpartitioned head is some 300 seconds. The
explain analyse output for each of these cases is attached.

Interesting.

This suggests that partitioning is not a suitable strategy for this query,
but then may be partition wise should not be picked for such a case to
aggravate the performance issue.

In the unpartitioned case, and in the partitioned case on head, the
join order is l1-(nation-supplier)-l2-orders-l3. In the patched case,
the join order changes to l1-l2-supplier-orders-nation-l3. If the
planner used the former join order, it wouldn't be able to do a
partition-wise join at all, so it must think that the l1-l2 join gets
much cheaper when done partitionwise, thus justifying a change in the
overall join order to be able to use partion-wise join. But it
doesn't work out.

I think the problem is that the row count estimates for the child
joins seem to be totally bogus:

-> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12)
(actual time=10484.422..15945.851 rows=1523493 loops=3)
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

That's clearly wrong. In the un-partitioned plan, the join to l2
produces about as many rows of output as the number of rows that were
input (998433 vs. 962909); but here, a child join with a million rows
as input is estimated to produce only 1 row of output. I bet the
problem is that the child-join's row count estimate isn't getting
initialized at all, but then something is clamping it to 1 row instead
of 0.

So this looks like a bug in Ashutosh's patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#142)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Jul 20, 2017 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think the problem is that the row count estimates for the child
joins seem to be totally bogus:

-> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12)
(actual time=10484.422..15945.851 rows=1523493 loops=3)
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

That's clearly wrong. In the un-partitioned plan, the join to l2
produces about as many rows of output as the number of rows that were
input (998433 vs. 962909); but here, a child join with a million rows
as input is estimated to produce only 1 row of output. I bet the
problem is that the child-join's row count estimate isn't getting
initialized at all, but then something is clamping it to 1 row instead
of 0.

So this looks like a bug in Ashutosh's patch.

Isn't this the same as the issue reported here?

/messages/by-id/CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#143)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Jul 20, 2017 at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think the problem is that the row count estimates for the child
joins seem to be totally bogus:

-> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12)
(actual time=10484.422..15945.851 rows=1523493 loops=3)
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

That's clearly wrong. In the un-partitioned plan, the join to l2
produces about as many rows of output as the number of rows that were
input (998433 vs. 962909); but here, a child join with a million rows
as input is estimated to produce only 1 row of output. I bet the
problem is that the child-join's row count estimate isn't getting
initialized at all, but then something is clamping it to 1 row instead
of 0.

So this looks like a bug in Ashutosh's patch.

Isn't this the same as the issue reported here?

/messages/by-id/CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com

Hmm, possibly. But why would that affect the partition-wise join case only?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#145Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#144)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Jul 20, 2017 at 2:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Isn't this the same as the issue reported here?

/messages/by-id/CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com

Hmm, possibly. But why would that affect the partition-wise join case only?

It doesn't. From Rafia's part_reg.zip we see a bunch of rows=1 that
turn out to be wrong by several orders of magnitude:

21_nopart_head.out: Hash Semi Join (cost=5720107.25..9442574.55
rows=1 width=50)
21_part_head.out: Hash Semi Join (cost=5423094.06..8847638.36
rows=1 width=38)
21_part_patched.out: Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12)

My guess is that the consequences of that bad estimate are sensitive
to arbitrary other parameters moving around, as you can see from the
big jump in execution time I showed in the that message, measured on
unpatched master of the day:

4 workers = 9.5s
3 workers = 39.7s

That's why why both parallel hash join and partition-wise join are
showing regressions on Q21: it's just flip-flopping between various
badly costed plans. Note that even without parallelism, the fix that
Tom Lane suggested gives a much better plan:

/messages/by-id/CAEepm=11BiYUkgXZNzMtYhXh4S3a9DwUP8O+F2_ZPeGzzJFPbw@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#141)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders

Lineitem -
Partition key = l_orderkey
No of partitions = 18

Orders -
Partition key = o_orderkey
No of partitions = 11

The patch set upto 0015 would refuse to join two partitioned relations
using a partition-wise join if they have different number of
partitions. Next patches implement a more advanced partition matching
algorithm only for list partitions. Those next patches would refuse to
apply partition-wise join for range partitioned tables. So, I am
confused as to how come partition-wise join is being chosen even when
the number of partitions differ.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#147Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#146)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/07/20 15:05, Ashutosh Bapat wrote:

On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders

Lineitem -
Partition key = l_orderkey
No of partitions = 18

Orders -
Partition key = o_orderkey
No of partitions = 11

The patch set upto 0015 would refuse to join two partitioned relations
using a partition-wise join if they have different number of
partitions. Next patches implement a more advanced partition matching
algorithm only for list partitions. Those next patches would refuse to
apply partition-wise join for range partitioned tables. So, I am
confused as to how come partition-wise join is being chosen even when
the number of partitions differ.

In 21_part_patched.out, I see that lineitem is partitionwise-joined with
itself.

Append

-> Hash Semi Join
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

-> Parallel Seq Scan on lineitem_001 l1
Filter: (l_receiptdate > l_commitdate)
Rows Removed by Filter: 919654

-> Hash
Buckets: 8388608 Batches: 1 Memory Usage: 358464kB
-> Seq Scan on lineitem_001 l2

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#142)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Jul 20, 2017 at 12:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:

This suggests that partitioning is not a suitable strategy for this query,
but then may be partition wise should not be picked for such a case to
aggravate the performance issue.

In the unpartitioned case, and in the partitioned case on head, the
join order is l1-(nation-supplier)-l2-orders-l3. In the patched case,
the join order changes to l1-l2-supplier-orders-nation-l3. If the
planner used the former join order, it wouldn't be able to do a
partition-wise join at all, so it must think that the l1-l2 join gets
much cheaper when done partitionwise, thus justifying a change in the
overall join order to be able to use partion-wise join. But it
doesn't work out.

I think the problem is that the row count estimates for the child
joins seem to be totally bogus:

-> Hash Semi Join (cost=309300.53..491665.60 rows=1 width=12)
(actual time=10484.422..15945.851 rows=1523493 loops=3)
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

That's clearly wrong. In the un-partitioned plan, the join to l2
produces about as many rows of output as the number of rows that were
input (998433 vs. 962909); but here, a child join with a million rows
as input is estimated to produce only 1 row of output. I bet the
problem is that the child-join's row count estimate isn't getting
initialized at all, but then something is clamping it to 1 row instead
of 0.

So this looks like a bug in Ashutosh's patch.

The patch does not have any changes to the selectivity estimation. It
might happen that some correction in selectivity estimation for
child-joins is required, but I have not spotted any code in
selectivity estimation that differentiates explicitly between child
and parent Vars and estimates. So, I am more inclined to believe
Thomas's theory. I will try Tom's suggested approach.

I am investigating this case with the setup that Rafia provided.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#149Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#147)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/07/20 15:05, Ashutosh Bapat wrote:

On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders

Lineitem -
Partition key = l_orderkey
No of partitions = 18

Orders -
Partition key = o_orderkey
No of partitions = 11

The patch set upto 0015 would refuse to join two partitioned relations
using a partition-wise join if they have different number of
partitions. Next patches implement a more advanced partition matching
algorithm only for list partitions. Those next patches would refuse to
apply partition-wise join for range partitioned tables. So, I am
confused as to how come partition-wise join is being chosen even when
the number of partitions differ.

In 21_part_patched.out, I see that lineitem is partitionwise-joined with
itself.

Append

-> Hash Semi Join
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

-> Parallel Seq Scan on lineitem_001 l1
Filter: (l_receiptdate > l_commitdate)
Rows Removed by Filter: 919654

-> Hash
Buckets: 8388608 Batches: 1 Memory Usage: 358464kB
-> Seq Scan on lineitem_001 l2

Ah, I see now.

We need the same number of partitions in all partitioned tables, for
joins to pick up partition-wise join.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Thomas Munro (#145)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Jul 20, 2017 at 8:53 AM, Thomas Munro <thomas.munro@enterprisedb.com

wrote:

On Thu, Jul 20, 2017 at 2:02 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Wed, Jul 19, 2017 at 7:45 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Isn't this the same as the issue reported here?

/messages/by-id/CAEepm=270ze2hVxWkJw-

5eKzc3AB4C9KpH3L2kih75R5pdSogg%40mail.gmail.com

Hmm, possibly. But why would that affect the partition-wise join case

only?

It doesn't. From Rafia's part_reg.zip we see a bunch of rows=1 that
turn out to be wrong by several orders of magnitude:

21_nopart_head.out: Hash Semi Join (cost=5720107.25..9442574.55
rows=1 width=50)
21_part_head.out: Hash Semi Join (cost=5423094.06..8847638.36
rows=1 width=38)
21_part_patched.out: Hash Semi Join (cost=309300.53..491665.60 rows=1
width=12)

My guess is that the consequences of that bad estimate are sensitive
to arbitrary other parameters moving around, as you can see from the
big jump in execution time I showed in the that message, measured on
unpatched master of the day:

4 workers = 9.5s
3 workers = 39.7s

That's why why both parallel hash join and partition-wise join are
showing regressions on Q21: it's just flip-flopping between various
badly costed plans. Note that even without parallelism, the fix that
Tom Lane suggested gives a much better plan:

/messages/by-id/CAEepm%25
3D11BiYUkgXZNzMtYhXh4S3a9DwUP8O%2BF2_ZPeGzzJFPbw%40mail.gmail.com

Following the discussion at [1]/messages/by-id/CAEepm=3=NHHko3oOzpik+ggLy17AO+px3rGYrg3x_x05+Br9-A@mail.gmail.com, with the patch Thomas posted there, now
Q21 completes in some 160 seconds. The plan is changed for the good but
does not use partition-wise join. The output of explain analyse is
attached.

Not just the join orders but the join strategy itself changed, with the
patch no hash semi join is picked which was consuming most time there,
rather nested loop semi join is in picture now, though the estimates are
still way-off, but the change in join-order made them terrible from
horrible. It appears like this query is performing efficient now
particularly because of worse under-estimated hash-join as compared to
under-estimated nested loop join.

For the hash-semi-join:
-> Hash (cost=3449457.34..3449457.34 rows=119994934 width=8) (actual
time=180858.448..180858.448 rows=119994608 loops=3)
Buckets: 33554432
Batches: 8 Memory Usage: 847911kB

Overall, this doesn't look like a problem of partition-wise join patch
itself.

[1]: /messages/by-id/CAEepm=3=NHHko3oOzpik+ggLy17AO+px3rGYrg3x_x05+Br9-A@mail.gmail.com
/messages/by-id/CAEepm=3=NHHko3oOzpik+ggLy17AO+px3rGYrg3x_x05+Br9-A@mail.gmail.com

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

Q21_SE_patch.outapplication/octet-stream; name=Q21_SE_patch.out
#151Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#149)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Jul 20, 2017 at 2:44 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/07/20 15:05, Ashutosh Bapat wrote:

On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Partition information:
Type of partitioning - single column range partition
Tables partitioned - Lineitem and orders

Lineitem -
Partition key = l_orderkey
No of partitions = 18

Orders -
Partition key = o_orderkey
No of partitions = 11

The patch set upto 0015 would refuse to join two partitioned relations
using a partition-wise join if they have different number of
partitions. Next patches implement a more advanced partition matching
algorithm only for list partitions. Those next patches would refuse to
apply partition-wise join for range partitioned tables. So, I am
confused as to how come partition-wise join is being chosen even when
the number of partitions differ.

In 21_part_patched.out, I see that lineitem is partitionwise-joined with
itself.

Append

-> Hash Semi Join
Hash Cond: (l1.l_orderkey = l2.l_orderkey)
Join Filter: (l2.l_suppkey <> l1.l_suppkey)
Rows Removed by Join Filter: 395116

-> Parallel Seq Scan on lineitem_001 l1
Filter: (l_receiptdate > l_commitdate)
Rows Removed by Filter: 919654

-> Hash
Buckets: 8388608 Batches: 1 Memory Usage: 358464kB
-> Seq Scan on lineitem_001 l2

Ah, I see now.

We need the same number of partitions in all partitioned tables, for
joins to pick up partition-wise join.

Oh, I missed this limitation, will modify my setup to have same number
of partitions in the partitioned table with same ranges. So, does this
also mean that a partitioned table will not join with an unpartitioned
table without append of partitions?

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#150)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jul 21, 2017 at 11:42 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Following the discussion at [1], with the patch Thomas posted there, now Q21
completes in some 160 seconds.

Your earlier reports mentioned unpartitioned case taking 300 seconds,
partitioned case without partition-wise join taking 600 seconds and
with partition-wise join it taking 3200 seconds. My experiements
showed that those have changed to 70s, 160s and 160s resp. This is
with Thomas's patch. Can you please confirm?

The plan is changed for the good but does not
use partition-wise join.

As explained earlier, this is because the tables are not partitioned
similarly. Please try with lineitem and orders partitioned similarly
i.e. same number of partitions and exactly same ranges.

Not just the join orders but the join strategy itself changed, with the
patch no hash semi join is picked which was consuming most time there,
rather nested loop semi join is in picture now, though the estimates are
still way-off, but the change in join-order made them terrible from
horrible. It appears like this query is performing efficient now
particularly because of worse under-estimated hash-join as compared to
under-estimated nested loop join.

Earlier it was using partition-wise join between lineitems (l1, l2,
l3) since it's the same table. Now for some reason the planner doesn't
find joining them to each other a better strategy, instead they are
joined indirectly so we don't see partition-wise join being picked. We
should experiment with orders and lineitems being partitioned
similarly. Can you please provide that result?

For the hash-semi-join:
-> Hash (cost=3449457.34..3449457.34 rows=119994934 width=8) (actual
time=180858.448..180858.448 rows=119994608 loops=3)
Buckets: 33554432
Batches: 8 Memory Usage: 847911kB

Overall, this doesn't look like a problem of partition-wise join patch
itself.

Thanks for confirming it.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#151)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jul 21, 2017 at 11:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

So, does this
also mean that a partitioned table will not join with an unpartitioned
table without append of partitions?

Yes. When you join an unpartitioned table with a partitioned table,
the planner will choose to append all the partitions of the
partitioned table and then join with the unpartitioned table.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#154Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#153)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jul 21, 2017 at 12:11 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Jul 21, 2017 at 11:54 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

So, does this
also mean that a partitioned table will not join with an unpartitioned
table without append of partitions?

Yes. When you join an unpartitioned table with a partitioned table,
the planner will choose to append all the partitions of the
partitioned table and then join with the unpartitioned table.

I tested this set of patches for TPC-H benchmark and came across following
results,
- total 7 queries were using partition-wise join,

- Q4 attains a speedup of around 80% compared to the partitioned setup
without partition-wise join, the main reason being the poor plan choice at
head for partitioned database.
When I tried this query with forced nested-loop join then it completes in
some 45 seconds at head. So, basically when no partition-wise join is
present because of terrible selectivity estimation optimiser picks up a
hash join plan, which results poorly as the estimated number of rows are
two orders of magnitude lesser than actual.
Note that this is not the effect of [1]/messages/by-id/CAEepm=3%25 3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/, I tried this without that patch as
well.

- other queries show a good 20-30% improvement in performance. Performance
numbers are as follows,

Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |

The experimental settings used were,

Partitioning: Range partitioning on lineitem and orders on l_orderkey and
o_orderkey respectively. The number and range of partitions were kept same
for both the tables.

Server parameters:
work_mem - 1GB
effective_cache_size - 8GB
shared_buffers - 8GB
enable_partition_wise_join = on

TPC-H setup:
scale-factor - 20

Commit id - 42171e2cd23c8307bbe0ec64e901f58e297db1c3, also, the patch at
[1]: /messages/by-id/CAEepm=3%25 3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
Query plans for the above mentioned queries is attached.

[1]: /messages/by-id/CAEepm=3%25 3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/
3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com
--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

part_perf.zipapplication/zip; name=part_perf.zip
#155Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Rafia Sabih (#154)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

- other queries show a good 20-30% improvement in performance. Performance
numbers are as follows,

Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |

Hmm. This certainly shows that benefit of the patch, although it's
rather sad that we're still slower than if we hadn't partitioned the
data in the first place. Why does partitioning hurt performance so
much?

Maybe things would be better at a higher scale factor.

When reporting results of this sort, it would be good to make a habit
of reporting the number of partitions along with the other details you
included.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156Dilip Kumar
Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#155)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

- other queries show a good 20-30% improvement in performance. Performance
numbers are as follows,

Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |

Hmm. This certainly shows that benefit of the patch, although it's
rather sad that we're still slower than if we hadn't partitioned the
data in the first place. Why does partitioning hurt performance so
much?

I was analysing some of the plans (without partition and with
partition), Seems like one of the reasons of performing worse with the
partitioned table is that we can not use an index on the partitioned
table.

Q4 is taking 17s without partition whereas it's taking 244s with partition.

Now if we analyze the plan

Without partition, it can use parameterize index scan on lineitem
table which is really big in size. But with partition, it has to scan
this table completely.

-> Nested Loop Semi Join
-> Parallel Bitmap Heap Scan on orders
-> Bitmap Index Scan on
idx_orders_orderdate (cost=0.00..24378.88 r
-> Index Scan using idx_lineitem_orderkey on
lineitem (cost=0.57..29.29 rows=105 width=8) (actual
time=0.031..0.031 rows=1 loops=1122364)
Index Cond: (l_orderkey =
orders.o_orderkey)
Filter: (l_commitdate < l_receiptdate)
Rows Removed by Filter: 1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#157Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Dilip Kumar (#156)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Jul 25, 2017 at 9:39 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

- other queries show a good 20-30% improvement in performance. Performance
numbers are as follows,

Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |

Hmm. This certainly shows that benefit of the patch, although it's
rather sad that we're still slower than if we hadn't partitioned the
data in the first place. Why does partitioning hurt performance so
much?

I was analysing some of the plans (without partition and with
partition), Seems like one of the reasons of performing worse with the
partitioned table is that we can not use an index on the partitioned
table.

Q4 is taking 17s without partition whereas it's taking 244s with partition.

Now if we analyze the plan

Without partition, it can use parameterize index scan on lineitem
table which is really big in size. But with partition, it has to scan
this table completely.

-> Nested Loop Semi Join
-> Parallel Bitmap Heap Scan on orders
-> Bitmap Index Scan on
idx_orders_orderdate (cost=0.00..24378.88 r
-> Index Scan using idx_lineitem_orderkey on
lineitem (cost=0.57..29.29 rows=105 width=8) (actual
time=0.031..0.031 rows=1 loops=1122364)
Index Cond: (l_orderkey =
orders.o_orderkey)
Filter: (l_commitdate < l_receiptdate)
Rows Removed by Filter: 1

If the partitions have the same indexes as the unpartitioned table,
planner manages to create parameterized plans for each partition and
thus parameterized plan for the whole partitioned table. Do we have
same indexes on unpartitioned table and each of the partitions? The
difference between the two cases is the parameterized path on an
unpartitioned table scans only one index whereas that on the
partitioned table scans the indexes on all the partitions. My guess is
the planner thinks those many scans are costlier than hash/merge join
and chooses those strategies over parameterized nest loop join. In
case of partition-wise join, only one index on the inner partition is
involved and thus partition-wise join picks up parameterized nest loop
join. Notice, that this index is much smaller than the index on the
partitioned table, so the index scan will be a bit faster. But only a
bit, since the depth of the index doesn't increase linearly with the
size of index.

Rrun-time partition pruning will improve performance even without
partition-wise join since partition pruning will be able to eliminate
all but one partition and only one index needs to be scanned. If
planner is smart enough to cost that effectively, it will choose
parameterized nest loop join for partitioned table thus improving the
performance similar to unpartitioned case.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#154)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Query plans for the above mentioned queries is attached.

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#159Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#158)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Query plans for the above mentioned queries is attached.

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

I didn't run the query not using partition-wise join, for now.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

#160Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#159)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Query plans for the above mentioned queries is attached.

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

I didn't run the query not using partition-wise join, for now.

parse-parse error, sorry. Do you mean, you haven't run the queries
which do not use partition-wise join?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#161Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#160)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 26, 2017 at 11:06 AM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Query plans for the above mentioned queries is attached.

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

I didn't run the query not using partition-wise join, for now.

parse-parse error, sorry. Do you mean, you haven't run the queries
which do not use partition-wise join?

Yes, that's what I mean.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

#162Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#161)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 26, 2017 at 11:08 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Wed, Jul 26, 2017 at 11:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jul 26, 2017 at 11:00 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Wed, Jul 26, 2017 at 10:58 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Jul 25, 2017 at 11:01 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Query plans for the above mentioned queries is attached.

Can you please share plans for all the queries, even if they haven't
chosen partition-wise join when run on partitioned tables with
enable_partition_wise_join ON? Also, please include the query in
explain analyze output using -a or -e flats to psql. That way we will
have query and its plan in the same file for ready reference.

I didn't run the query not using partition-wise join, for now.

parse-parse error, sorry. Do you mean, you haven't run the queries
which do not use partition-wise join?

Yes, that's what I mean.

Ok. If those queries have equi-join between partitioned tables and are
not picking up partition-wise join, that case needs to be
investigated. Q21 for example has join between three lineitem
instances. Those joins can be executed by partition-wise join. But it
may so happen that optimal join order doesn't join partitioned tables
with each other, thus interleaving partitioned tables with
unpartitioned or differently partitioned tables in join order.
Partition-wise join is not possible then. A different partitioning
scheme may be required there.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#163Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#157)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 26, 2017 at 10:38 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Jul 25, 2017 at 9:39 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jul 25, 2017 at 8:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jul 25, 2017 at 1:31 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

- other queries show a good 20-30% improvement in performance. Performance
numbers are as follows,

Query| un_part_head (seconds) | part_head (seconds) | part_patch (seconds) |
3 | 76 |127 | 88 |
4 |17 | 244 | 41 |
5 | 52 | 123 | 84 |
7 | 73 | 134 | 103 |
10 | 67 | 111 | 89 |
12 | 53 | 114 | 99 |
18 | 447 | 709 | 551 |

Hmm. This certainly shows that benefit of the patch, although it's
rather sad that we're still slower than if we hadn't partitioned the
data in the first place. Why does partitioning hurt performance so
much?

I was analysing some of the plans (without partition and with
partition), Seems like one of the reasons of performing worse with the
partitioned table is that we can not use an index on the partitioned
table.

Q4 is taking 17s without partition whereas it's taking 244s with partition.

Now if we analyze the plan

Without partition, it can use parameterize index scan on lineitem
table which is really big in size. But with partition, it has to scan
this table completely.

-> Nested Loop Semi Join
-> Parallel Bitmap Heap Scan on orders
-> Bitmap Index Scan on
idx_orders_orderdate (cost=0.00..24378.88 r
-> Index Scan using idx_lineitem_orderkey on
lineitem (cost=0.57..29.29 rows=105 width=8) (actual
time=0.031..0.031 rows=1 loops=1122364)
Index Cond: (l_orderkey =
orders.o_orderkey)
Filter: (l_commitdate < l_receiptdate)
Rows Removed by Filter: 1

If the partitions have the same indexes as the unpartitioned table,
planner manages to create parameterized plans for each partition and
thus parameterized plan for the whole partitioned table. Do we have
same indexes on unpartitioned table and each of the partitions? The

Yes both lineitem and orders have same number of partitions viz 17 and
on the same partitioning key (*_orderkey) and same ranges for each
partition. However, I missed creating the index on o_orderdate for the
partitions. But on creating it as well, the plan with bitmap heap scan
is used and it still completes in some 200 seconds, check the attached
file for the query plan.

difference between the two cases is the parameterized path on an
unpartitioned table scans only one index whereas that on the
partitioned table scans the indexes on all the partitions. My guess is
the planner thinks those many scans are costlier than hash/merge join
and chooses those strategies over parameterized nest loop join. In
case of partition-wise join, only one index on the inner partition is
involved and thus partition-wise join picks up parameterized nest loop
join. Notice, that this index is much smaller than the index on the
partitioned table, so the index scan will be a bit faster. But only a
bit, since the depth of the index doesn't increase linearly with the
size of index.

As I have observed, the thing with this query is that selectivity
estimation is too high than actual, now when index scan is chosen for
lineitem being in the inner side of NLJ, the query completes quickly
since the number of actual returned rows is too low. However, in case
we pick seq scan, or lineitem is on the outer side, the query is going
to take a really long time. Now, when Hash-Join is picked in the case
of partitioned database and no partition-wise join is available, seq
scan is preferred instead of index scan and hence the elongated query
execution time.

I tried this query with random_page_cost = 0 and forcing NLJ and the
chosen plan completes the query in 45 seconds, check the attached file
for explain analyse output.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

q4_idx_orderdate.outapplication/octet-stream; name=q4_idx_orderdate.out
Q4_low_random_page_cost.outapplication/octet-stream; name=Q4_low_random_page_cost.out
#164Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#162)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jul 26, 2017 at 12:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Ok. If those queries have equi-join between partitioned tables and are
not picking up partition-wise join, that case needs to be
investigated. Q21 for example has join between three lineitem
instances. Those joins can be executed by partition-wise join. But it
may so happen that optimal join order doesn't join partitioned tables
with each other, thus interleaving partitioned tables with
unpartitioned or differently partitioned tables in join order.
Partition-wise join is not possible then. A different partitioning
scheme may be required there.

Good point, will look into this direction as well.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#165Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#140)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's revised patch set with only 0004 revised. That patch deals with
creating multi-level inheritance hierarchy from multi-level partition
hierarchy. The original logic of recursively calling
inheritance_planner()'s guts over the inheritance hierarchy required
that for every such recursion we flatten many lists created by that
code. Recursion also meant that root->append_rel_list is traversed as
many times as the number of partitioned partitions in the hierarchy.
Instead the revised version keep the iterative shape of
inheritance_planner() intact, thus naturally creating flat lists,
iterates over root->append_rel_list only once and is still easy to
read and maintain.

0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004:

+        if (brel->reloptkind != RELOPT_BASEREL &&
+            brte->relkind != RELKIND_PARTITIONED_TABLE)

I spent a lot of time staring at this code before I figured out what
was going on here. We're iterating over simple_rel_array, so the
reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
But does that guarantee that rtekind is RTE_RELATION such that
brte->relkind will be initialized to a value? I'm not 100% sure. I
think it would be clearer to write this test like this:

Assert(IS_SIMPLE_REL(brel));
if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
(brte->rtekind != RELOPT_BASEREL ||
brte->relkind != RELKIND_PARTITIONED_TABLE))
continue;

Note that the way you wrote the comment is says if it *is* another
REL, not if it's *not* a baserel; it's good if those kinds of little
details match between the code and the comments.

It is not clear to me what the motivation is for the API changes in
expanded_inherited_rtentry. They don't appear to be necessary. If
they are necessary, you need to do a more thorough job updating the
comments. This one, in particular:

* If so, add entries for all the child tables to the query's
* rangetable, and build AppendRelInfo nodes for all the child tables
* and add them to root->append_rel_list. If not, clear the entry's

And the comments could maybe say something like "We return the list of
appinfos rather than directly appending it to append_rel_list because
$REASON."

-         * is a partitioned table.
+         * RTE simply duplicates the parent *partitioned* table.
          */
-        if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
+        if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)

This is another case where it's hard to understand the test from the comments.

+     * In case of multi-level inheritance hierarchy, for every child we require
+     * PlannerInfo of its immediate parent. Hence we save those in a an array

Say why. Also, need to fix "a an".

I'm a little bit surprised that this patch doesn't make any changes to
allpaths.c or relnode.c. It looks to me like we'll generate paths for
the new RTEs that are being added. Are we going to end up with
multiple levels of Append nodes, then? Does the consider the way
consider_parallel is propagated up and down in set_append_rel_size()
and set_append_rel_pathlist() really work with multiple levels? Maybe
this is all fine; I haven't tried to verify it in depth.

Overall I think this is a reasonable direction to go but I'm worried
that there may be bugs lurking -- other code that needs adjusting that
hasn't been found, really.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#166Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#165)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's revised patch set with only 0004 revised. That patch deals with
creating multi-level inheritance hierarchy from multi-level partition
hierarchy. The original logic of recursively calling
inheritance_planner()'s guts over the inheritance hierarchy required
that for every such recursion we flatten many lists created by that
code. Recursion also meant that root->append_rel_list is traversed as
many times as the number of partitioned partitions in the hierarchy.
Instead the revised version keep the iterative shape of
inheritance_planner() intact, thus naturally creating flat lists,
iterates over root->append_rel_list only once and is still easy to
read and maintain.

0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004:

+        if (brel->reloptkind != RELOPT_BASEREL &&
+            brte->relkind != RELKIND_PARTITIONED_TABLE)

I spent a lot of time staring at this code before I figured out what
was going on here. We're iterating over simple_rel_array, so the
reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
But does that guarantee that rtekind is RTE_RELATION such that
brte->relkind will be initialized to a value? I'm not 100% sure.

Comment in RangeTblEntry says
952 /*
953 * Fields valid for a plain relation RTE (else zero):
954 *
... clipped portion for RTE_NAMEDTUPLESTORE related comment

960 Oid relid; /* OID of the relation */
961 char relkind; /* relation kind (see pg_class.relkind) */

This means that relkind will be 0 when rtekind != RTE_RELATION. So,
the condition holds. But code creating an RTE somewhere which is not
in sync with this comment would create a problem. So your suggestion
makes sense.

I
think it would be clearer to write this test like this:

Assert(IS_SIMPLE_REL(brel));
if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
(brte->rtekind != RELOPT_BASEREL ||

Do you mean (brte_>rtekind != RTE_RELATION)?

brte->relkind != RELKIND_PARTITIONED_TABLE))
continue;

Note that the way you wrote the comment is says if it *is* another
REL, not if it's *not* a baserel; it's good if those kinds of little
details match between the code and the comments.

I find the existing comment and code in this part of the function
differ. The comment just above the loop on simple_rel_array[], talks
about changing something in the child, but the very next line skips
child relations and later a loop on append_rel_list makes changes to
appropriate children. I guess, it's done that way to keep the code
working even after we introduce some RELOPTKIND other than BASEREL or
OTHER_MEMBER_REL for a simple rel. But your suggestion makes more
sense. Changed it according to your suggestion.

It is not clear to me what the motivation is for the API changes in
expanded_inherited_rtentry. They don't appear to be necessary.

expand_inherited_rtentry() creates AppendRelInfos for all the children
of a given parent and collects them in a list. The list is appended to
root->append_rel_list at the end of the function. Now that function
needs to do this recursively. This means that for a partitioned
partition table its children's AppendRelInfos will be added to
root->append_rel_list before AppendRelInfo of that partitioned
partition table. inheritance_planner() assumes that the parent's
AppendRelInfo comes before its children in append_rel_list.This
assumption allows it to be simplified greately, retaining its
iterative form. My earlier patches had recursive version of
inheritance_planner(), which is complex. I have comments in this patch
explaining this.

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

If
they are necessary, you need to do a more thorough job updating the
comments. This one, in particular:

* If so, add entries for all the child tables to the query's
* rangetable, and build AppendRelInfo nodes for all the child tables
* and add them to root->append_rel_list. If not, clear the entry's

Done.

And the comments could maybe say something like "We return the list of
appinfos rather than directly appending it to append_rel_list because
$REASON."

Done. Please check the attached version.

-         * is a partitioned table.
+         * RTE simply duplicates the parent *partitioned* table.
*/
-        if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
+        if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)

This is another case where it's hard to understand the test from the comments.

The current comment says it all, but it very cryptic manner.
1526 /*
1527 * Build an AppendRelInfo for this parent and child,
unless the child
1528 * RTE simply duplicates the parent *partitioned* table.
1529 */

The comment makes sense in the context of this paragraph in the prologue
1364 * Note that the original RTE is considered to represent the whole
1365 * inheritance set. The first of the generated RTEs is an RTE for the same
1366 * table, but with inh = false, to represent the parent table in its role
1367 * as a simple member of the inheritance set.
1368 *

The code avoids creating AppendRelInfos for a child which represents
the parent in its role as a simple member of inheritance set.

I have reworded it as
1526 /*
1527 * Build an AppendRelInfo for this parent and child,
unless the child
1528 * RTE represents the parent as a simple member of inheritance set.
1529 */

+     * In case of multi-level inheritance hierarchy, for every child we require
+     * PlannerInfo of its immediate parent. Hence we save those in a an array

Say why. Also, need to fix "a an".

Done.

I'm a little bit surprised that this patch doesn't make any changes to
allpaths.c or relnode.c.

It looks to me like we'll generate paths for
the new RTEs that are being added. Are we going to end up with
multiple levels of Append nodes, then? Does the consider the way
consider_parallel is propagated up and down in set_append_rel_size()
and set_append_rel_pathlist() really work with multiple levels? Maybe
this is all fine; I haven't tried to verify it in depth.

This has been discussed before, but I can not locate the mail
answering these questions. accumulate_append_subpath() called from
add_paths_to_append_rel() takes care of flattening Merge/Append paths.
The planner code deals with the multi-level inheritance hierarchy
created for subqueries with set operations. There is code in relnode.c
to build the RelOptInfos for such subqueries recursively through using
RangeTblEntry::inh flag. So there are no changes in allpaths.c and
relnode.c. Are you looking for some other changes?

Overall I think this is a reasonable direction to go but I'm worried
that there may be bugs lurking -- other code that needs adjusting that
hasn't been found, really.

Planner code is already aware of such hierarchies except DMLs, which
this patch adjusts. We have fixed issues revealed by mine and
Rajkumar's testing.
What kinds of things you suspect?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#167Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#166)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Forgot the patch set. Here it is.

On Mon, Jul 31, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Jul 31, 2017 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 14, 2017 at 3:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's revised patch set with only 0004 revised. That patch deals with
creating multi-level inheritance hierarchy from multi-level partition
hierarchy. The original logic of recursively calling
inheritance_planner()'s guts over the inheritance hierarchy required
that for every such recursion we flatten many lists created by that
code. Recursion also meant that root->append_rel_list is traversed as
many times as the number of partitioned partitions in the hierarchy.
Instead the revised version keep the iterative shape of
inheritance_planner() intact, thus naturally creating flat lists,
iterates over root->append_rel_list only once and is still easy to
read and maintain.

0001-0003 look basically OK to me, modulo some cosmetic stuff. Regarding 0004:

+        if (brel->reloptkind != RELOPT_BASEREL &&
+            brte->relkind != RELKIND_PARTITIONED_TABLE)

I spent a lot of time staring at this code before I figured out what
was going on here. We're iterating over simple_rel_array, so the
reloptkind must be RELOPT_OTHER_MEMBER_REL if it isn't RELOPT_BASEREL.
But does that guarantee that rtekind is RTE_RELATION such that
brte->relkind will be initialized to a value? I'm not 100% sure.

Comment in RangeTblEntry says
952 /*
953 * Fields valid for a plain relation RTE (else zero):
954 *
... clipped portion for RTE_NAMEDTUPLESTORE related comment

960 Oid relid; /* OID of the relation */
961 char relkind; /* relation kind (see pg_class.relkind) */

This means that relkind will be 0 when rtekind != RTE_RELATION. So,
the condition holds. But code creating an RTE somewhere which is not
in sync with this comment would create a problem. So your suggestion
makes sense.

I
think it would be clearer to write this test like this:

Assert(IS_SIMPLE_REL(brel));
if (brel->reloptkind == RELOPT_OTHER_MEMBER_REL &&
(brte->rtekind != RELOPT_BASEREL ||

Do you mean (brte_>rtekind != RTE_RELATION)?

brte->relkind != RELKIND_PARTITIONED_TABLE))
continue;

Note that the way you wrote the comment is says if it *is* another
REL, not if it's *not* a baserel; it's good if those kinds of little
details match between the code and the comments.

I find the existing comment and code in this part of the function
differ. The comment just above the loop on simple_rel_array[], talks
about changing something in the child, but the very next line skips
child relations and later a loop on append_rel_list makes changes to
appropriate children. I guess, it's done that way to keep the code
working even after we introduce some RELOPTKIND other than BASEREL or
OTHER_MEMBER_REL for a simple rel. But your suggestion makes more
sense. Changed it according to your suggestion.

It is not clear to me what the motivation is for the API changes in
expanded_inherited_rtentry. They don't appear to be necessary.

expand_inherited_rtentry() creates AppendRelInfos for all the children
of a given parent and collects them in a list. The list is appended to
root->append_rel_list at the end of the function. Now that function
needs to do this recursively. This means that for a partitioned
partition table its children's AppendRelInfos will be added to
root->append_rel_list before AppendRelInfo of that partitioned
partition table. inheritance_planner() assumes that the parent's
AppendRelInfo comes before its children in append_rel_list.This
assumption allows it to be simplified greately, retaining its
iterative form. My earlier patches had recursive version of
inheritance_planner(), which is complex. I have comments in this patch
explaining this.

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

If
they are necessary, you need to do a more thorough job updating the
comments. This one, in particular:

* If so, add entries for all the child tables to the query's
* rangetable, and build AppendRelInfo nodes for all the child tables
* and add them to root->append_rel_list. If not, clear the entry's

Done.

And the comments could maybe say something like "We return the list of
appinfos rather than directly appending it to append_rel_list because
$REASON."

Done. Please check the attached version.

-         * is a partitioned table.
+         * RTE simply duplicates the parent *partitioned* table.
*/
-        if (childrte->relkind != RELKIND_PARTITIONED_TABLE)
+        if (childrte->relkind != RELKIND_PARTITIONED_TABLE || childrte->inh)

This is another case where it's hard to understand the test from the comments.

The current comment says it all, but it very cryptic manner.
1526 /*
1527 * Build an AppendRelInfo for this parent and child,
unless the child
1528 * RTE simply duplicates the parent *partitioned* table.
1529 */

The comment makes sense in the context of this paragraph in the prologue
1364 * Note that the original RTE is considered to represent the whole
1365 * inheritance set. The first of the generated RTEs is an RTE for the same
1366 * table, but with inh = false, to represent the parent table in its role
1367 * as a simple member of the inheritance set.
1368 *

The code avoids creating AppendRelInfos for a child which represents
the parent in its role as a simple member of inheritance set.

I have reworded it as
1526 /*
1527 * Build an AppendRelInfo for this parent and child,
unless the child
1528 * RTE represents the parent as a simple member of inheritance set.
1529 */

+     * In case of multi-level inheritance hierarchy, for every child we require
+     * PlannerInfo of its immediate parent. Hence we save those in a an array

Say why. Also, need to fix "a an".

Done.

I'm a little bit surprised that this patch doesn't make any changes to
allpaths.c or relnode.c.

It looks to me like we'll generate paths for
the new RTEs that are being added. Are we going to end up with
multiple levels of Append nodes, then? Does the consider the way
consider_parallel is propagated up and down in set_append_rel_size()
and set_append_rel_pathlist() really work with multiple levels? Maybe
this is all fine; I haven't tried to verify it in depth.

This has been discussed before, but I can not locate the mail
answering these questions. accumulate_append_subpath() called from
add_paths_to_append_rel() takes care of flattening Merge/Append paths.
The planner code deals with the multi-level inheritance hierarchy
created for subqueries with set operations. There is code in relnode.c
to build the RelOptInfos for such subqueries recursively through using
RangeTblEntry::inh flag. So there are no changes in allpaths.c and
relnode.c. Are you looking for some other changes?

Overall I think this is a reasonable direction to go but I'm worried
that there may be bugs lurking -- other code that needs adjusting that
hasn't been found, really.

Planner code is already aware of such hierarchies except DMLs, which
this patch adjusts. We have fixed issues revealed by mine and
Rajkumar's testing.
What kinds of things you suspect?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v23.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v23.tar.gz
#168Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#166)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

So, IIUC, the case you're concerned about is when you have a hierarchy
of only partitioned tables, with no plain tables. For example, if B
is a partitioned table and a partition of A, and that's all there is,
A will recurse to B and B will return NIL.

Is it necessary to get rid of the extra AppendRelInfos, or are they
harmless like the duplicate RTE and PlanRowMark nodes?

/*
* If all the children were temp tables or a partitioned parent did not
* have any leaf partitions, pretend it's a non-inheritance situation; we
* don't need Append node in that case. The duplicate RTE we added for
* the parent table is harmless, so we don't bother to get rid of it;
* ditto for the useless PlanRowMark node.
*/
if (!need_append)
{
/* Clear flag before returning */
rte->inh = false;
return;
}

If we do need to get rid of the extra AppendRelInfos, maybe a less
invasive solution would be to change the if (!need_append) case to do
root->append_rel_list = list_truncate(root->append_rel_list,
original_append_rel_length).

The code avoids creating AppendRelInfos for a child which represents
the parent in its role as a simple member of inheritance set.

OK, I suggest we rewrite the whole comment like this: "We need an
AppendRelInfo if paths will be built for the child RTE. If
childrte->inh is true, then we'll always need to generate append paths
for it. If childrte->inh is false, we must scan it if it's not a
partitioned table; but if it is a partitioned table, then it never has
any data of its own and need not be scanned. It does, however, need
to be locked, so note the OID for inclusion in the
PartitionedChildRelInfo we're going to build."

It looks like you also need to update the header comment for
AppendRelInfo itself, in nodes/relation.h.

+ * PlannerInfo for every child is obtained by translating relevant members

Insert "The" at the start of the sentence.

-        subroot->parse = (Query *)
-            adjust_appendrel_attrs(root,
-                                   (Node *) parse,
-                                   appinfo);
+        subroot->parse = (Query *) adjust_appendrel_attrs(parent_root,
+                                                          (Node *)
parent_parse,
+                                                          1, &appinfo);

I suggest that you don't remove the line break after the cast.

+         * If the child is further partitioned, remember it as a parent. Since
+         * partitioned tables do not have any data, we don't need to create
+         * plan for it. We just need its PlannerInfo set up to be used while
+         * planning its children.

Most of this comment is in the singular, but the first half of the
second sentence is plural. Should be "Since a partitioned table does
not have any data...". I might replace the last sentence by "We do,
however, need to remember the PlannerInfo for use when planning its
children."

+-- Check UPDATE with *multi-level partitioned* inherited target

Asterisks seem like overkill.

Since expand_inherited_rtentry() and set_append_rel_size() can now
recurse down to as many levels as there are levels in the inheritance
hierarchy, they should probably have a check_stack_depth() check.

Overall I think this is a reasonable direction to go but I'm worried
that there may be bugs lurking -- other code that needs adjusting that
hasn't been found, really.

Planner code is already aware of such hierarchies except DMLs, which
this patch adjusts. We have fixed issues revealed by mine and
Rajkumar's testing.
What kinds of things you suspect?

I'm not sure exactly. It's just hard with this kind of patch to make
sure you've caught everything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#169Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#167)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Jul 31, 2017 at 9:07 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Forgot the patch set. Here it is.

The commit message for 0005 isn't really accurate given that it
follows 0004. I think you could just flatten 0005 and 0006 into one
patch.

Reviewing those together:

- Existing code does partdesc = RelationGetPartitionDesc(relation) but
this has got it as part_desc. Seems better to be consistent.
Likewise existing variables for PartitionKey are key or partkey, not
part_key.

- get_relation_partition_info has a useless trailing return.

- Instead of adding nparts, boundinfo, and part_oids to RelOptInfo,
how about just adding partdesc? Seems cleaner.

- pkexprs seems like a potentially confusing name, since PK is widely
used to mean "primary key" but here you mean "partition key". Maybe
partkeyexprs.

- build_simple_rel's matching algorithm is O(n^2). We may have talked
about this problem before...

- This patch introduces some bits that are not yet used, like
nullable_pkexprs, or even the code to set the partition scheme for
joinrels. I think perhaps some of that logic should be moved from
0008 to here - e.g. the initial portion of
build_joinrel_partition_info.

There may be more, but I've run out of energy for tonight.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#170Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#168)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 3, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

So, IIUC, the case you're concerned about is when you have a hierarchy
of only partitioned tables, with no plain tables. For example, if B
is a partitioned table and a partition of A, and that's all there is,
A will recurse to B and B will return NIL.

Is it necessary to get rid of the extra AppendRelInfos, or are they
harmless like the duplicate RTE and PlanRowMark nodes?

Actually there are two sides to this:

If there are no leaf partitions, without the patch two things happen
1. rte->inh is cleared and 2 no appinfo is added to the
root->append_rel_list, even though harmless RTE and PlanRowMark nodes
are created. The first avoids treating the relation as the inheritance
parent and thus avoids creating any child relations and paths, saving
a lot of work. Ultimately set_rel_size() marks such a relation as
dummy
352 else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
353 {
354 /*
355 * A partitioned table without leaf
partitions is marked
356 * as a dummy rel.
357 */
358 set_dummy_rel_pathlist(rel);
359 }

Since root->append_rel_list is traversed for every inheritance parent,
not adding needless AppendRelInfos improves performance and saves
memory, (FWIW or consider a case where there are thousands of
partitioned partitions without any leaf partition.).

My initial thought was to keep both these properties intact. But then
removing such AppendRelInfos would have a problem when such a table is
on the inner side of the join as described in [1]/messages/by-id/CAFjFpRd5+zroxY7UMGTR2M=rjBV4aBOCxQg3+1rBmTPLK5mpDg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company. So I wrote the
patch not to do either of those things when there are partitioned
partitions without leaf partitions. So, it looks like you are correct,
we could just go ahead and add those AppendRelInfos directly to
root->append_rel_list.

/*
* If all the children were temp tables or a partitioned parent did not
* have any leaf partitions, pretend it's a non-inheritance situation; we
* don't need Append node in that case. The duplicate RTE we added for
* the parent table is harmless, so we don't bother to get rid of it;
* ditto for the useless PlanRowMark node.
*/
if (!need_append)
{
/* Clear flag before returning */
rte->inh = false;
return;
}

If we do need to get rid of the extra AppendRelInfos, maybe a less
invasive solution would be to change the if (!need_append) case to do
root->append_rel_list = list_truncate(root->append_rel_list,
original_append_rel_length).

We might require this for non-partitioned tables. I will try to
implement it this way in the next set of patches.

The code avoids creating AppendRelInfos for a child which represents
the parent in its role as a simple member of inheritance set.

OK, I suggest we rewrite the whole comment like this: "We need an
AppendRelInfo if paths will be built for the child RTE. If
childrte->inh is true, then we'll always need to generate append paths
for it. If childrte->inh is false, we must scan it if it's not a
partitioned table; but if it is a partitioned table, then it never has
any data of its own and need not be scanned. It does, however, need
to be locked, so note the OID for inclusion in the
PartitionedChildRelInfo we're going to build."

Done.

It looks like you also need to update the header comment for
AppendRelInfo itself, in nodes/relation.h.

Done. Thanks for pointing it out.

+ * PlannerInfo for every child is obtained by translating relevant members

Insert "The" at the start of the sentence.

Done.

-        subroot->parse = (Query *)
-            adjust_appendrel_attrs(root,
-                                   (Node *) parse,
-                                   appinfo);
+        subroot->parse = (Query *) adjust_appendrel_attrs(parent_root,
+                                                          (Node *)
parent_parse,
+                                                          1, &appinfo);

I suggest that you don't remove the line break after the cast.

This is part of 0001 patch, fixed there.

+         * If the child is further partitioned, remember it as a parent. Since
+         * partitioned tables do not have any data, we don't need to create
+         * plan for it. We just need its PlannerInfo set up to be used while
+         * planning its children.

Most of this comment is in the singular, but the first half of the
second sentence is plural. Should be "Since a partitioned table does
not have any data...". I might replace the last sentence by "We do,
however, need to remember the PlannerInfo for use when planning its
children."

Done.

+-- Check UPDATE with *multi-level partitioned* inherited target

Asterisks seem like overkill.

Done.

This style was copied from an existing comment in that file.
-- Check UPDATE with *partitioned* inherited target

Since expand_inherited_rtentry() and set_append_rel_size() can now
recurse down to as many levels as there are levels in the inheritance
hierarchy, they should probably have a check_stack_depth() check.

Done. Even without this patch set_append_rel_size() could recurse down
many levels of inheritance hierarchy (created by set operation
queries) through
set_append_rel_size()->set_rel_size()->set_append_rel_size(). And so
would set_rel_size(). But now it's more prone to that problem.

I will provide updated patches after taking care of your comments
about 0005 and 0006.

[1]: /messages/by-id/CAFjFpRd5+zroxY7UMGTR2M=rjBV4aBOCxQg3+1rBmTPLK5mpDg@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#171Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#170)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 3, 2017 at 9:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Aug 3, 2017 at 2:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jul 31, 2017 at 7:59 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Adding AppendRelInfos to root->append_rel_list as and when they are
created would keep parent AppendRelInfos before those of children. But
that function throws away the AppendRelInfo it created when their are
no real children i.e. in partitioned table's case when has no leaf
partitions. So, we can't do that. Hence, I chose to change the API to
return the list of AppendRelInfos when the given RTE has real
children.

So, IIUC, the case you're concerned about is when you have a hierarchy
of only partitioned tables, with no plain tables. For example, if B
is a partitioned table and a partition of A, and that's all there is,
A will recurse to B and B will return NIL.

Is it necessary to get rid of the extra AppendRelInfos, or are they
harmless like the duplicate RTE and PlanRowMark nodes?

Actually there are two sides to this:

If there are no leaf partitions, without the patch two things happen
1. rte->inh is cleared and 2 no appinfo is added to the
root->append_rel_list, even though harmless RTE and PlanRowMark nodes
are created. The first avoids treating the relation as the inheritance
parent and thus avoids creating any child relations and paths, saving
a lot of work. Ultimately set_rel_size() marks such a relation as
dummy
352 else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
353 {
354 /*
355 * A partitioned table without leaf
partitions is marked
356 * as a dummy rel.
357 */
358 set_dummy_rel_pathlist(rel);
359 }

Since root->append_rel_list is traversed for every inheritance parent,
not adding needless AppendRelInfos improves performance and saves
memory, (FWIW or consider a case where there are thousands of
partitioned partitions without any leaf partition.).

With some testing, I found that this was true once, but not after
declarative partition support. Please check [1]/messages/by-id/CAFjFpReWJr1yTkHU=OqiMBmcYCMoSW3VPR39RBuQ_ovwDFBT5Q@mail.gmail.com.

My initial thought was to keep both these properties intact. But then
removing such AppendRelInfos would have a problem when such a table is
on the inner side of the join as described in [1]. So I wrote the
patch not to do either of those things when there are partitioned
partitions without leaf partitions. So, it looks like you are correct,
we could just go ahead and add those AppendRelInfos directly to
root->append_rel_list.

Irrespective of [1]/messages/by-id/CAFjFpReWJr1yTkHU=OqiMBmcYCMoSW3VPR39RBuQ_ovwDFBT5Q@mail.gmail.com, I have implemented your idea of not changing
signature of expand_inherited_rtentry() with following idea.

If we do need to get rid of the extra AppendRelInfos, maybe a less
invasive solution would be to change the if (!need_append) case to do
root->append_rel_list = list_truncate(root->append_rel_list,
original_append_rel_length).

[1]: /messages/by-id/CAFjFpReWJr1yTkHU=OqiMBmcYCMoSW3VPR39RBuQ_ovwDFBT5Q@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#172Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#169)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 3, 2017 at 7:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jul 31, 2017 at 9:07 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Forgot the patch set. Here it is.

The commit message for 0005 isn't really accurate given that it
follows 0004. I think you could just flatten 0005 and 0006 into one
patch.

Earlier, there was some doubt about the approach for expanding
multi-level partitioned table's inheritance hierarchy. So, I had
separated all multi-level partition related changes into patches by
themselves, collocating them with their respective single level
partition peers. I thought that would make the reviews easier while
leaving the possibility of committing single-level partition-wise
support before multi-level partition-wise join support. From your
previous replies, it seems that you are fine with the multi-level
partitioned hierarchy expansion, so it may be committed along-with
other patches. So, I have squashed those two patches together.
Similarly I have squashed pairs 0008-0009 and 0012-0013. Those dealt
with similar issues for single-level partitioned and multi-level
partitioned tables.

Reviewing those together:

- Existing code does partdesc = RelationGetPartitionDesc(relation) but
this has got it as part_desc. Seems better to be consistent.
Likewise existing variables for PartitionKey are key or partkey, not
part_key.

Done.

- get_relation_partition_info has a useless trailing return.

Done.

- Instead of adding nparts, boundinfo, and part_oids to RelOptInfo,
how about just adding partdesc? Seems cleaner.

nparts and boundinfo apply to any kind of relation simple, join or
upper but part_oids applies only to simple relations. So, I have split
those members and added them in respective sections. Do you still
think that we should add PartitionDesc as a single member?

Similar to your suggestion of changing name of part_key to partkey,
should we rename part_scheme as partscheme, part_rels as partrels and
part_oids as partoids?

- pkexprs seems like a potentially confusing name, since PK is widely
used to mean "primary key" but here you mean "partition key". Maybe
partkeyexprs.

agreed. Done. PartitionKey structure has member partexprs for
partition keys which are expressions. I have used the same name
instread of pkexprs.

- build_simple_rel's matching algorithm is O(n^2). We may have talked
about this problem before...

If root->append_rel_list has AppendRelInfos in the same order as the
partition bounds, we could reduce this to O(n). That expansion option
is being discussed in [1]/messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp. Once we commit it, I will change the code
to make it O(n). Right now, we can not rely on the order of
AppendRelInfos in root->append_rel_list.

- This patch introduces some bits that are not yet used, like
nullable_pkexprs,

We could fix that by adding that member in 0008. IIRC, earlier you had
complained about declaring a structure in one patch and adding members
to it in the subsequent patches, so I just added all members in the
same patch. BTW, I have renamed that member to nullable_partexprs to
be consistent with change to pkexpers.

or even the code to set the partition scheme for
joinrels. I think perhaps some of that logic should be moved from
0008 to here - e.g. the initial portion of
build_joinrel_partition_info.

Setting part_scheme for joinrel should really be part of the patch
which actually implements partition-wise join. That will keep all the
partition-wise join implementation together. 0005 and 0006 really just
introduce PartitionScheme for base relation. I think PartitionScheme
and other partitioning properties for base relation are useful for
something else like partition-wise aggregation on base relation. So,
we may want to commit those two patches separately. If you want, we
could squash the partition scheme and partition-wise join
implementation together.

[1]: /messages/by-id/0118a1f2-84bb-19a7-b906-dec040a206f2@lab.ntt.co.jp

Updated patches attached.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v24.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v24.tar.gz
#173Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Ashutosh Bapat (#172)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Aug 8, 2017 at 8:51 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Updated patches attached.

Hi,

I started reviewing this. It's nicely commented, but it's also very
complicated, and it's going to take me several rounds to understand
what all the parts do, but here's some assorted feedback after reading
some parts of the patches, some tinkering and quite a bit of time
spent trying to break it (so far unsuccessfully).

On my computer it took ~1.5 seconds to plan a 1000 partition join,
~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
4000 partition join. I poked around in a profiler a bit and saw that
for the 2000 partition case I spent almost half the time in
create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
and about half of that was in bms_is_subset. The other half the time
was in query_planner->make_one_rel which spent 2/3 of its time in
set_rel_size->add_child_rel_equivalences->bms_overlap and the other
1/3 in standard_join_search.

One micro-optimisation opportunity I noticed was in those
bms_is_subset and bms_overlap calls. The Bitmapsets don't tend to
have trailing words but often have hundreds of empty leading words.
If I hack bitmapset.{c,h} so that it tracks first_non_empty_wordnum
and then adjust bms_is_subset and bms_overlap so that they start their
searches at Min(a->first_non_empty_wordnum,
b->first_non_empty_wordnum) then the planning time improves
measurably:

1000 partitions: ~1.5s -> 1.3s
2000 partitions: ~7.1s -> 5.8s
4000 partitions: ~50s -> ~44s

When using list-based partitions, it must be possible to omit the part
of a join key that is implied by the partition because the partition
has only one list value. For example, if I create a two level
hierarchy with one partition per US state and then time-based range
partitions under that, the state part of this merge condition is
redundant:

Merge Cond: ((sales_wy_2017_10.state =
purchases_wy_2017_10.state) AND (sales_wy_2017_10.created =
purchases_wy_2017_10.created))

0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch

-partition_bounds_equal(PartitionKey key,
+partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
                                           PartitionBoundInfo b1,
PartitionBoundInfo b2)

I wonder is there any value in creating a struct to represent the
common part of PartitionKey and PartitionScheme that functions like
this and others need? Maybe not. Perhaps you didn't want to make
PartitionKey contain a PartitionScheme because then you'd lose the
property that every pointer to PartitionScheme in the system must be a
pointer to an interned (canonical) PartitionScheme, so it's always
safe to compare pointers to test for equality?

0005-Canonical-partition-scheme.patch:

+/*
+ * get_relation_partition_info
+ *
+ * Retrieves partitioning information for a given relation.
+ *
+ * For a partitioned table it sets partitioning scheme, partition key
+ * expressions, number of partitions and OIDs of partitions in the given
+ * RelOptInfo.
+ */
+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation relation)

Would this be better called "set_relation_partition_info"? It doesn't
really "retrieve" the information, it "installs" it.

+{
+       PartitionDesc part_desc;
+
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))
+               return;

Here and elsewhere you use the idiom !(foo = bar), which is perfectly
good C in my book but I understand the project convention is to avoid
implicit pointer->boolean conversion and to prefer expressions like
(foo = bar) != NULL and there is certainly a lot more code like that.

0007-Partition-wise-join-implementation.patch

+ {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,

This GUC should appear in postgresql.conf.sample.

I'm chewing on 0007. More soon.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#174Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#173)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On my computer it took ~1.5 seconds to plan a 1000 partition join,
~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
4000 partition join. I poked around in a profiler a bit and saw that
for the 2000 partition case I spent almost half the time in
create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
and about half of that was in bms_is_subset. The other half the time
was in query_planner->make_one_rel which spent 2/3 of its time in
set_rel_size->add_child_rel_equivalences->bms_overlap and the other
1/3 in standard_join_search.

Ashutosh asked me how I did that. Please see attached. I was
explaining simple joins like SELECT * FROM foofoo JOIN barbar USING
(a, b). Here also is the experimental hack I tried when I saw
bitmapset.c eating my CPU.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

bitmapset-track-leading-empty-space.patchapplication/octet-stream; name=bitmapset-track-leading-empty-space.patch
make-partitions.shapplication/x-sh; name=make-partitions.sh
#175Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Thomas Munro (#174)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 10, 2017 at 9:28 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Aug 10, 2017 at 1:39 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On my computer it took ~1.5 seconds to plan a 1000 partition join,
~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
4000 partition join. I poked around in a profiler a bit and saw that
for the 2000 partition case I spent almost half the time in
create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
and about half of that was in bms_is_subset. The other half the time
was in query_planner->make_one_rel which spent 2/3 of its time in
set_rel_size->add_child_rel_equivalences->bms_overlap and the other
1/3 in standard_join_search.

Ashutosh asked me how I did that. Please see attached. I was
explaining simple joins like SELECT * FROM foofoo JOIN barbar USING
(a, b). Here also is the experimental hack I tried when I saw
bitmapset.c eating my CPU.

On my machine I observed following planning times
1000 partitions, without partition-wise join, 100ms; with
partition-wise join 500ms
2000 partitions, without partition-wise join, 320ms; with
partition-wise join 2.2s
4000 partitions, without partition-wise join, 1.3ms; with
partition-wise join 17s

So, even without partition-wise join the planning time increases at a
superlinear rate with the number of partitions.

Your patch didn't improve planning time without partition-wise join,
so it's something good to have along-with partition-wise join. Given
that Bitmapsets are used in other parts of code as well, the
optimization may affect those parts as well, esp. the overhead of
maintaining first_non_empty_wordnum.

The comment at the beginning of the file bitmapset.c says
3 * bitmapset.c
4 * PostgreSQL generic bitmap set package
5 *
6 * A bitmap set can represent any set of nonnegative integers, although
7 * it is mainly intended for sets where the maximum value is not large,
8 * say at most a few hundred.

When we created thousands of children, we have certainly crossed the
few hundred threashold. So, there may be other optimizations possible
there. Probably we should leave that out of partition-wise join
patches. Do you think we solving this problem is a prerequisite for
partition-wise join? Or should we propose that patch as a separate
enhancement?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#176Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Ashutosh Bapat (#175)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Your patch didn't improve planning time without partition-wise join,
so it's something good to have along-with partition-wise join. Given
that Bitmapsets are used in other parts of code as well, the
optimization may affect those parts as well, esp. the overhead of
maintaining first_non_empty_wordnum.

Maybe, but if you consider that this container already deals with the
upper bound moving up by reallocating and copying the whole thing,
adjusting an int when the lower bound moves down doesn't seem like
anything to worry about...

The comment at the beginning of the file bitmapset.c says
3 * bitmapset.c
4 * PostgreSQL generic bitmap set package
5 *
6 * A bitmap set can represent any set of nonnegative integers, although
7 * it is mainly intended for sets where the maximum value is not large,
8 * say at most a few hundred.

When we created thousands of children, we have certainly crossed the
few hundred threashold. So, there may be other optimizations possible
there. Probably we should leave that out of partition-wise join
patches.

+1

Do you think we solving this problem is a prerequisite for
partition-wise join? Or should we propose that patch as a separate
enhancement?

No, I'm not proposing anything yet. For now I just wanted to share
this observation about where hot CPU time goes in simple tests, and
since it turned out to be a loop in a loop that I could see an easy to
way to fix for singleton sets and sets with a small range, I couldn't
help trying it... But I'm still trying to understand the bigger
picture. I'll be interested to compare profiles with the ordered
append_rel_list version you have mentioned, to see how that moves the
hot spots.

I guess one very practical question to ask is: can we plan queries
with realistic numbers of partitioned tables and partitions in
reasonable times? Well, it certainly looks very good for hundreds of
partitions so far... My own experience of partitioning with other
RDBMSs has been on that order, 'monthly partitions covering the past
10 years' and similar, but on the other hand it wouldn't be surprising
to learn that people want to go to many thousands, especially for
schemas which just keep adding partitions over time and don't want to
drop them. As for hash partitioning, that seems to be typically done
with numbers like 16, 32 or 64 in other products from what I can
glean. Speculation: perhaps hash partitioning is more motivated by
parallelism than data maintenance and thus somehow anchored to the
ground by core counts; if so no planning performance worries there I
guess (until core counts double quite a few more times).

One nice thing about the planning time is that restrictions on the
partition key cut down planning time; so where I measure ~7 seconds to
plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
if someone has a keep-adding-more-partitions-over-time model then at
least their prunable current day/week/whatever queries will not suffer
quite so badly. (Yeah my computer seems to be a lot slower than yours
for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
2.2Ghz).

Curious: would you consider joins between partitioned tables and
non-partitioned tables where the join is pushed down to be a kind of
"partition-wise join", or something else? If so, would that be a
special case, or just the logical extreme case for
0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
one single "partition" on the non-partitioned side maps to all the
partitions on the partitioned size?

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#177Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#175)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Aug 9, 2017 at 7:09 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I started reviewing this. It's nicely commented, but it's also very
complicated, and it's going to take me several rounds to understand
what all the parts do, but here's some assorted feedback after reading
some parts of the patches, some tinkering and quite a bit of time
spent trying to break it (so far unsuccessfully).

Thanks for testing the patch. Good to know it has withstood your testing.

On my computer it took ~1.5 seconds to plan a 1000 partition join,
~7.1 seconds to plan a 2000 partition join, and ~50 seconds to plan a
4000 partition join. I poked around in a profiler a bit and saw that
for the 2000 partition case I spent almost half the time in
create_plan->...->prepare_sort_from_pathkeys->find_ec_member_for_tle,
and about half of that was in bms_is_subset. The other half the time
was in query_planner->make_one_rel which spent 2/3 of its time in
set_rel_size->add_child_rel_equivalences->bms_overlap and the other
1/3 in standard_join_search.

Thanks for profiling.

I have separately mailed about bitmapset improvements.

Equivalence classes contain all the expressions which are known to be
equal in EquivalenceClass::ec_members. For a partitioned table, there
will be as many expressions as the number of children. The child
expressions are marked as em_is_child and are looked at only when
child relids are available to the function scanning the members. The
number of equivalence members increases linearly with the number of
partitions, and the number of words in the bitmaps increases linearly
with the number of partitions, effectively the the number of words
scanned increases quadratically. Hence the superlinear increase in
time with the number of partitions. When I took separate profiles with
1000, 2000 and 4000 partitions resp. I see that 15%, 29% and 40% time
spent in bms_is_subset() resp.

I am not sure how much we can do in this patchset to reduce this
problem. Apart from your bitmapset optimization, we could perhaps use
some more efficient data structure other than list to search members
based on the relids OR re-use parent's expressions for child somehow.
I have been thinking about the second option, but never got a chance
to work on it.

When using list-based partitions, it must be possible to omit the part
of a join key that is implied by the partition because the partition
has only one list value. For example, if I create a two level
hierarchy with one partition per US state and then time-based range
partitions under that, the state part of this merge condition is
redundant:

Merge Cond: ((sales_wy_2017_10.state =
purchases_wy_2017_10.state) AND (sales_wy_2017_10.created =
purchases_wy_2017_10.created))

That's a good idea. In fact, we could use a similar trick when the
condition is sales_wy_2017_10.state = 'state'. We can not use the
trick in case of DML or when there are locking clauses, since we need
to evaluate the qual in case the row underneath changes while locking
it. We also can not do this when one of the keys being compared is a
nullable partition key (a concept explained in partition-wise join
implementation patch), since a partition can have also have rows with
NULL values for such partition keys in that partition.

I think the idea has merit, although, I think we should handle it
targetting more generic cases like the one stated above.

0003-Refactor-partition_bounds_equal-to-be-used-without-P.patch

-partition_bounds_equal(PartitionKey key,
+partition_bounds_equal(int partnatts, int16 *parttyplen, bool *parttypbyval,
PartitionBoundInfo b1,
PartitionBoundInfo b2)

I wonder is there any value in creating a struct to represent the
common part of PartitionKey and PartitionScheme that functions like
this and others need? Maybe not. Perhaps you didn't want to make
PartitionKey contain a PartitionScheme because then you'd lose the
property that every pointer to PartitionScheme in the system must be a
pointer to an interned (canonical) PartitionScheme, so it's always
safe to compare pointers to test for equality?

Right. Other reason to keep those two separate, is we might change the
contents of PartitionScheme as we move forward with the reviews. May
be we should revisit it after we have finalised the design.

0005-Canonical-partition-scheme.patch:

+/*
+ * get_relation_partition_info
+ *
+ * Retrieves partitioning information for a given relation.
+ *
+ * For a partitioned table it sets partitioning scheme, partition key
+ * expressions, number of partitions and OIDs of partitions in the given
+ * RelOptInfo.
+ */
+static void
+get_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
+                                                       Relation relation)

Would this be better called "set_relation_partition_info"? It doesn't
really "retrieve" the information, it "installs" it.

Yes. Done.

+{
+       PartitionDesc part_desc;
+
+       /* No partitioning information for an unpartitioned relation. */
+       if (relation->rd_rel->relkind != RELKIND_PARTITIONED_TABLE ||
+               !(rel->part_scheme = find_partition_scheme(root, relation)))
+               return;

Here and elsewhere you use the idiom !(foo = bar), which is perfectly
good C in my book but I understand the project convention is to avoid
implicit pointer->boolean conversion and to prefer expressions like
(foo = bar) != NULL and there is certainly a lot more code like that.

PG code uses both the styles, search "if (!" in execExpr.c,
createplan.c for example.
I find this style useful, when I want to code, say "if this
relation does not have a partitioning scheme" rather than "if this
relation have NULL partitioning scheme".

0007-Partition-wise-join-implementation.patch

+ {"enable_partition_wise_join", PGC_USERSET, QUERY_TUNING_METHOD,

This GUC should appear in postgresql.conf.sample.

Done.

Attached patches with the comments addressed.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v25.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v25.tar.gz
#178Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Thomas Munro (#176)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 10, 2017 at 3:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Aug 10, 2017 at 6:23 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Your patch didn't improve planning time without partition-wise join,
so it's something good to have along-with partition-wise join. Given
that Bitmapsets are used in other parts of code as well, the
optimization may affect those parts as well, esp. the overhead of
maintaining first_non_empty_wordnum.

Maybe, but if you consider that this container already deals with the
upper bound moving up by reallocating and copying the whole thing,
adjusting an int when the lower bound moves down doesn't seem like
anything to worry about...

Yeah. May be we should check whether that makes any difference to
planning times of TPC-H queries. If it shows any difference.

Do you think we solving this problem is a prerequisite for
partition-wise join? Or should we propose that patch as a separate
enhancement?

No, I'm not proposing anything yet. For now I just wanted to share
this observation about where hot CPU time goes in simple tests, and
since it turned out to be a loop in a loop that I could see an easy to
way to fix for singleton sets and sets with a small range, I couldn't
help trying it... But I'm still trying to understand the bigger
picture. I'll be interested to compare profiles with the ordered
append_rel_list version you have mentioned, to see how that moves the
hot spots.

build_simple_rel() which contains that loop takes only .23% of
planning time. So, I doubt if that's going to change much.
+ 0.23% postgres postgres [.] build_simple_rel

I guess one very practical question to ask is: can we plan queries
with realistic numbers of partitioned tables and partitions in
reasonable times? Well, it certainly looks very good for hundreds of
partitions so far... My own experience of partitioning with other
RDBMSs has been on that order, 'monthly partitions covering the past
10 years' and similar, but on the other hand it wouldn't be surprising
to learn that people want to go to many thousands, especially for
schemas which just keep adding partitions over time and don't want to
drop them. As for hash partitioning, that seems to be typically done
with numbers like 16, 32 or 64 in other products from what I can
glean. Speculation: perhaps hash partitioning is more motivated by
parallelism than data maintenance and thus somehow anchored to the
ground by core counts; if so no planning performance worries there I
guess (until core counts double quite a few more times).

Agreed.

One nice thing about the planning time is that restrictions on the
partition key cut down planning time; so where I measure ~7 seconds to
plan SELECT * FROM foofoo JOIN barbar USING (a, b) with 2k partitions,
if I add WHERE a > 50 it's ~4 seconds and WHERE a > 99 it's ~0.8s, so
if someone has a keep-adding-more-partitions-over-time model then at
least their prunable current day/week/whatever queries will not suffer
quite so badly. (Yeah my computer seems to be a lot slower than yours
for these tests; clang -O2 no asserts on a mid 2014 MBP with i7 @
2.2Ghz).

That's interesting observation. Thanks for sharing it.

Curious: would you consider joins between partitioned tables and
non-partitioned tables where the join is pushed down to be a kind of
"partition-wise join", or something else? If so, would that be a
special case, or just the logical extreme case for
0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
one single "partition" on the non-partitioned side maps to all the
partitions on the partitioned size?

Parameterized nest loop joins with partition key as parameters
simulate something like that. Apart from that case, I don't see any
case where such a join would be more efficient compared to the current
method of ganging all partitions and joining them to the unpartitioned
table. But oh wait, that could be useful in sharding, when the
unpartitioned table is replicated and partitioned table is distributed
across shards. So, yes, that's a useful case. I am not sure whether
it's some kind of partition-wise join; it doesn't matter, it looks
useful. Said that, I am not planning to handle it in the near future.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#179Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#176)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 10, 2017 at 5:43 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Do you think we solving this problem is a prerequisite for
partition-wise join? Or should we propose that patch as a separate
enhancement?

No, I'm not proposing anything yet. For now I just wanted to share
this observation about where hot CPU time goes in simple tests, and
since it turned out to be a loop in a loop that I could see an easy to
way to fix for singleton sets and sets with a small range, I couldn't
help trying it... But I'm still trying to understand the bigger
picture. I'll be interested to compare profiles with the ordered
append_rel_list version you have mentioned, to see how that moves the
hot spots.

Perhaps this is stating the obvious, but it's often better to optimize
things like this at a higher level, rather than by tinkering with
stuff like Bitmapset. On the other hand, sometimes
micro-optimizations are the way to go, because optimizing
find_ec_member_for_tle(), for example, might involve a much broader
rethink of the planner code than we want to undertake right now.

I guess one very practical question to ask is: can we plan queries
with realistic numbers of partitioned tables and partitions in
reasonable times? Well, it certainly looks very good for hundreds of
partitions so far... My own experience of partitioning with other
RDBMSs has been on that order, 'monthly partitions covering the past
10 years' and similar, but on the other hand it wouldn't be surprising
to learn that people want to go to many thousands, especially for
schemas which just keep adding partitions over time and don't want to
drop them.

I've been thinking that it would be good if this feature - and other
new partitioning features - could scale to about 1000 partitions
without too much trouble. Eventually, it might be nice to scale
higher, but there's not much point in making partition-wise join scale
to 100,000 partitions if we've got some other part of the system that
runs into trouble beyond 250.

Curious: would you consider joins between partitioned tables and
non-partitioned tables where the join is pushed down to be a kind of
"partition-wise join", or something else? If so, would that be a
special case, or just the logical extreme case for
0014-WIP-Partition-wise-join-for-1-1-1-0-0-1-partition-ma.patch, where
one single "partition" on the non-partitioned side maps to all the
partitions on the partitioned size?

I think this is actually a really important case which we've just
excluded from the initial scope because the problem is hard enough
already. But it's quite possible that if you are joining partitioned
tables A and B with unpartitioned table X, the right join order could
be A-X-B; the A-X join might knock out a lot of rows. It's not great
to have to pick between doing the A-B join partitionwise and doing the
A-X join first; you want to get both things. But we can't do
everything at once.

Further down the road, there's more than one way of doing the A-X
join. You could join each partition of A to all of X, which is likely
optimal if for example you are doing a nested loop with an inner index
scan on X. But you could also partition X on the fly using A's
partitioning scheme and then join partitions of A against the
on-the-fly-partitioned version of X. That's likely to be a lot better
for a merge join with an underlying sort on X.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#180Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#177)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Attached patches with the comments addressed.

I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4
and e139f1953f29db245f60a7acb72fccb1e05d2442.

0004 doesn't apply any more, probably due to commit
d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along
these lines could be separately committed prior to the main patch, and
I think that would be a good idea just to flush out any bugs in this
part independently of the rest. However, I also think that we
probably ought to try to get Amit Langote's changes to this function
to repair the locking order and expand in bound order committed before
proceeding with these changes.

In fact, I think there's a certain amount of conflict between what's
being discussed over there and what you're trying to do here. In that
thread, we propose to move partitioned tables at any level to the
front of the inheritance expansion. Here, however, you want to expand
level by level. I think partitioned-tables-first is the right
approach for the reasons discussed on the other thread; namely, we
want to be able to prune leaf partitions before expanding them, but
that requires us to expand all the non-leaf tables first to maintain a
consistent locking order in all scenarios. So the approach you've
taken in this patch may need to be re-thought somewhat.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#181Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#180)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Aug 15, 2017 at 10:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 10, 2017 at 8:00 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Attached patches with the comments addressed.

I have committed 0001-0003 as 480f1f4329f1bf8bfbbcda8ed233851e1b110ad4
and e139f1953f29db245f60a7acb72fccb1e05d2442.

Thanks a lot Robert. Some less patches to maintain :).

0004 doesn't apply any more, probably due to commit
d57929afc7063431f80a0ac4c510fc39aacd22e6. I think something along
these lines could be separately committed prior to the main patch, and
I think that would be a good idea just to flush out any bugs in this
part independently of the rest. However, I also think that we
probably ought to try to get Amit Langote's changes to this function
to repair the locking order and expand in bound order committed before
proceeding with these changes.

I am reviewing those changes and contribute to that thread if necessary.

In fact, I think there's a certain amount of conflict between what's
being discussed over there and what you're trying to do here. In that
thread, we propose to move partitioned tables at any level to the
front of the inheritance expansion. Here, however, you want to expand
level by level. I think partitioned-tables-first is the right
approach for the reasons discussed on the other thread; namely, we
want to be able to prune leaf partitions before expanding them, but
that requires us to expand all the non-leaf tables first to maintain a
consistent locking order in all scenarios. So the approach you've
taken in this patch may need to be re-thought somewhat.

There are two ways we can do this
1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
or just childRTIndex (using this we can fetch childRTE calling
rtfetch()) of intermediate partitioned tables. Once we are done
expanding immediate children, call expand_inherited_rtentry()
recursively on this list.

2. expand_inherited_tables() scans root->parse->rtable only upto the
end of original range table list. Make it go beyond that end,
expanding any new entries added for intermediate partitions.

FWIW, the first option allows us to keep all AppendRelInfos
corresponding to one partitioned relation together and also expands
the whole partition hierarchy in one go. Second will require minimal
changes to expand_inherited_rtentry(). Both approaches will spend time
scanning same number of RTE; the first will have them in different
lists, and second will have them in root->parse->rtable. I don't see
one being more attractive than the other. Do you have any opinion?

I will submit the rebased patches after reviewing/adjusting Amit's
changes and also the changes in expand_inherited_rtentry() after we
have concluded the approach to be taken.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#182Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#181)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

There are two ways we can do this
1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
or just childRTIndex (using this we can fetch childRTE calling
rtfetch()) of intermediate partitioned tables. Once we are done
expanding immediate children, call expand_inherited_rtentry()
recursively on this list.

2. expand_inherited_tables() scans root->parse->rtable only upto the
end of original range table list. Make it go beyond that end,
expanding any new entries added for intermediate partitions.

FWIW, the first option allows us to keep all AppendRelInfos
corresponding to one partitioned relation together and also expands
the whole partition hierarchy in one go. Second will require minimal
changes to expand_inherited_rtentry(). Both approaches will spend time
scanning same number of RTE; the first will have them in different
lists, and second will have them in root->parse->rtable. I don't see
one being more attractive than the other. Do you have any opinion?

I don't like option (2). I'm not sure about option (1). I think
maybe we should have two nested loops in expanded_inherited_rtentry(),
the outer one iterating over partitioned tables (or just the original
parent RTE if partitioning is not involved) and then inner one looping
over individual leaf partitions for each partitioned table. Probably
we'd end up wanting to move at least some of the logic inside the
existing loop into a subroutine.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#183Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#182)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Aug 16, 2017 at 5:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 16, 2017 at 3:31 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

There are two ways we can do this
1. In expand_inherited_rtentry(), remember (childRTE and childRTIndex)
or just childRTIndex (using this we can fetch childRTE calling
rtfetch()) of intermediate partitioned tables. Once we are done
expanding immediate children, call expand_inherited_rtentry()
recursively on this list.

2. expand_inherited_tables() scans root->parse->rtable only upto the
end of original range table list. Make it go beyond that end,
expanding any new entries added for intermediate partitions.

FWIW, the first option allows us to keep all AppendRelInfos
corresponding to one partitioned relation together and also expands
the whole partition hierarchy in one go. Second will require minimal
changes to expand_inherited_rtentry(). Both approaches will spend time
scanning same number of RTE; the first will have them in different
lists, and second will have them in root->parse->rtable. I don't see
one being more attractive than the other. Do you have any opinion?

I don't like option (2). I'm not sure about option (1). I think
maybe we should have two nested loops in expanded_inherited_rtentry(),
the outer one iterating over partitioned tables (or just the original
parent RTE if partitioning is not involved) and then inner one looping
over individual leaf partitions for each partitioned table. Probably
we'd end up wanting to move at least some of the logic inside the
existing loop into a subroutine.

I originally thought to provide it along-with the changes to
expand_inherited_rtentry(), but that thread is taking longer. Jeevan
Chalke needs rebased patches for his work on aggregate pushdown and
Thomas might need them for further review. So, here they are. The last
two patches in this set implement the advanced partition matching
algorithm. Those patches are here for ready reference. One can observe
that patch doesn't change much of the basic partition-wise join
implementation. I am starting a new thread for discussing the advanced
partition matching algorithm.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v26.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v26.tar.gz
#184Antonin Houska
Antonin Houska
ah@cybertec.at
In reply to: Ashutosh Bapat (#183)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I originally thought to provide it along-with the changes to
expand_inherited_rtentry(), but that thread is taking longer. Jeevan
Chalke needs rebased patches for his work on aggregate pushdown and
Thomas might need them for further review. So, here they are.

Since I have related patch in the current commitfest
(https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your
patch:

* generate_partition_wise_join_paths()

Right parenthesis is missing in the prologue.

* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.

* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
* The equi-join between partition keys is strict if equi-join between
* at least one partition key is using a strict operator. See
* explanation about outer join reordering identity 3 in
* optimizer/README
*/
strict_op = op_strict(opexpr->opno);

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.

* match_expr_to_partition_keys()

I'm not sure this comment is clear enough:

/*
* If it's a strict equi-join a NULL partition key on one side will
* not join a NULL partition key on the other side. So, rows with NULL
* partition key from a partition on one side can not join with those
* from a non-matching partition on the other side. So, search the
* nullable partition keys as well.
*/
if (!strict_op)
continue;

My understanding of the problem of NULL values generated by outer join is:
these NULL values --- if evaluated by non-strict expression --- can make row
of N-th partition on one side of the join match row(s) of *other than* N-th
partition(s) on the other side. Thus the nullable input expressions may only
be evaluated by strict operators. I think it'd be clearer if you stressed that
(undesired) *match* of partition keys can be a problem, rather than mismatch.

If you insist on your wording, then I think you should at least move the
comment below to the part that only deals with strict operators.

* There are several places where lfirst_node() macro should be used. For
example

rel = lfirst_node(RelOptInfo, lc);

instead of

rel = (RelOptInfo *) lfirst(lc);

* map_and_merge_partitions()

Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
to suppress compiler warnings) I think that this part needs more thought:

{
Assert(mergemap1[index1] != mergemap2[index2] &&
mergemap1[index1] >= 0 && mergemap2[index2] >= 0);

/*
* Both the partitions map to different merged partitions. This
* means that multiple partitions from one relation matches to one
* partition from the other relation. Partition-wise join does not
* handle this case right now, since it requires ganging multiple
* partitions together (into one RelOptInfo).
*/
merged_index = -1;
}

I could hit this path with the following test:

CREATE TABLE a(i int) PARTITION BY LIST(i);
CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
CREATE TABLE b(j int) PARTITION BY LIST(j);
CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);

SET enable_partition_wise_join TO on;

SELECT *
FROM a
FULL JOIN
b ON i = j;

I don't think there's a reason not to join a_0 partition to b_0, is there?

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Attachments:

assert.difftext/x-diff
map_and_merge_partitions.difftext/x-diff
#185Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#183)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

Here's set of patches rebased on the latest head.

This rebase mainly changes patch 0001, which translates partition
hierarchy into inheritance hierarchy creating AppendRelInfos and
RelOptInfos for partitioned partitions. Because of that, it's not
necessary to record the partitioned partitions in a
PartitionedChildRelInfos::child_rels. The only RTI that goes in there
is the RTI of child RTE which is same as the parent RTE except inh
flag. I tried removing that with a series of changes but it seems that
following code in ExecInitModifyTable() requires it.
1897 /* The root table RT index is at the head of the
partitioned_rels list */
1898 if (node->partitioned_rels)
1899 {
1900 Index root_rti;
1901 Oid root_oid;
1902
1903 root_rti = linitial_int(node->partitioned_rels);
1904 root_oid = getrelid(root_rti, estate->es_range_table);
1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
1906 }
1907 else
1908 rel = mtstate->resultRelInfo->ri_RelationDesc;

I don't know whether we could change this code not to use
PartitionedChildRelInfos::child_rels. Removing
PartitionedChildRelInfos machinary seems like a separate patch.

The last two patches implement the advanced partition matching
algorithm and are here in this set for ready reference. Please use [1]/messages/by-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
for discussing/reviewing those.

[1]: /messages/by-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches.v27.tar.gzapplication/x-gzip; name=pg_dp_join_patches.v27.tar.gz
#186Etsuro Fujita
Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#185)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/02 4:10, Ashutosh Bapat wrote:

This rebase mainly changes patch 0001, which translates partition
hierarchy into inheritance hierarchy creating AppendRelInfos and
RelOptInfos for partitioned partitions. Because of that, it's not
necessary to record the partitioned partitions in a
PartitionedChildRelInfos::child_rels. The only RTI that goes in there
is the RTI of child RTE which is same as the parent RTE except inh
flag. I tried removing that with a series of changes but it seems that
following code in ExecInitModifyTable() requires it.
1897 /* The root table RT index is at the head of the
partitioned_rels list */
1898 if (node->partitioned_rels)
1899 {
1900 Index root_rti;
1901 Oid root_oid;
1902
1903 root_rti = linitial_int(node->partitioned_rels);
1904 root_oid = getrelid(root_rti, estate->es_range_table);
1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
1906 }
1907 else
1908 rel = mtstate->resultRelInfo->ri_RelationDesc;

I don't know whether we could change this code not to use
PartitionedChildRelInfos::child_rels.

Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1]/messages/by-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8@lab.ntt.co.jp. I'll add that patch
to the next commitfest.

Best regards,
Etsuro Fujita

[1]: /messages/by-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8@lab.ntt.co.jp
/messages/by-id/93cf9816-2f7d-0f67-8ed2-4a4e497a6ab8@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#187Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Etsuro Fujita (#186)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/04 12:38, Etsuro Fujita wrote:

On 2017/09/02 4:10, Ashutosh Bapat wrote:

This rebase mainly changes patch 0001, which translates partition
hierarchy into inheritance hierarchy creating AppendRelInfos and
RelOptInfos for partitioned partitions. Because of that, it's not
necessary to record the partitioned partitions in a
PartitionedChildRelInfos::child_rels. The only RTI that goes in there
is the RTI of child RTE which is same as the parent RTE except inh
flag. I tried removing that with a series of changes but it seems that
following code in ExecInitModifyTable() requires it.
1897     /* The root table RT index is at the head of the
partitioned_rels list */
1898     if (node->partitioned_rels)
1899     {
1900         Index       root_rti;
1901         Oid         root_oid;
1902
1903         root_rti = linitial_int(node->partitioned_rels);
1904         root_oid = getrelid(root_rti, estate->es_range_table);
1905         rel = heap_open(root_oid, NoLock);  /* locked by InitPlan */
1906     }
1907     else
1908         rel = mtstate->resultRelInfo->ri_RelationDesc;

I don't know whether we could change this code not to use
PartitionedChildRelInfos::child_rels.

For a root partitioned tables, ModifyTable.partitioned_rels comes from
PartitionedChildRelInfo.child_rels recorded for the table by
expand_inherited_rtnentry(). In fact, the latter is copied verbatim to
ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same.
The only point of keeping those RT indexes around in the ModifyTable node
is for the executor to be able to locate partitioned table RT entries and
lock them. Without them, the executor wouldn't know about those tables at
all, because there won't be subplans corresponding to partitioned tables
in the tree and hence their RT indexes won't appear in the
ModifyTable.resultRelations list. If your patch adds partitioned child
rel AppendRelInfos back for whatever reason, you should also make sure in
inheritance_planner() that their RT indexes don't end up the
resultRelations list. See this piece of code in inheritance_planner():

1351 /* Build list of sub-paths */
1352 subpaths = lappend(subpaths, subpath);
1353
1354 /* Build list of modified subroots, too */
1355 subroots = lappend(subroots, subroot);
1356
1357 /* Build list of target-relation RT indexes */
1358 resultRelations = lappend_int(resultRelations,
appinfo->child_relid);

Maybe it won't happen, because if this appinfo corresponds to a
partitioned child table, recursion would occur and we'll get to this piece
of code for only the leaf children.

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table. Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code. We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.

Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1].  I'll add that patch to
the next commitfest.

+1.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#188Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Antonin Houska (#184)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah@cybertec.at> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I originally thought to provide it along-with the changes to
expand_inherited_rtentry(), but that thread is taking longer. Jeevan
Chalke needs rebased patches for his work on aggregate pushdown and
Thomas might need them for further review. So, here they are.

Since I have related patch in the current commitfest
(https://commitfest.postgresql.org/14/1247/), I spent some time reviewing your
patch:

* generate_partition_wise_join_paths()

Right parenthesis is missing in the prologue.

Thanks for pointing that out. Fixed.

* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.

The assert at the end of function also checks that we have got
child_rels lists for all the parents passed in. That is not checked by
your version. Furthermore, we would checked that each child_rels has
at least one element while buildings paths for base relations.
Checking the same again for joins doesn't add any value.

* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
* The equi-join between partition keys is strict if equi-join between
* at least one partition key is using a strict operator. See
* explanation about outer join reordering identity 3 in
* optimizer/README
*/
strict_op = op_strict(opexpr->opno);

What in that comment is not exactly relevant?

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.

I think it does that. Do you have a case where it doesn't?

* match_expr_to_partition_keys()

I'm not sure this comment is clear enough:

/*
* If it's a strict equi-join a NULL partition key on one side will
* not join a NULL partition key on the other side. So, rows with NULL
* partition key from a partition on one side can not join with those
* from a non-matching partition on the other side. So, search the
* nullable partition keys as well.
*/
if (!strict_op)
continue;

My understanding of the problem of NULL values generated by outer join is:
these NULL values --- if evaluated by non-strict expression --- can make row
of N-th partition on one side of the join match row(s) of *other than* N-th
partition(s) on the other side. Thus the nullable input expressions may only
be evaluated by strict operators. I think it'd be clearer if you stressed that
(undesired) *match* of partition keys can be a problem, rather than mismatch

Sorry, I am not able to understand this. To me it looks like my
wording conveys what you are saying.

If you insist on your wording, then I think you should at least move the
comment below to the part that only deals with strict operators.

Done.

* There are several places where lfirst_node() macro should be used. For
example

rel = lfirst_node(RelOptInfo, lc);

instead of

rel = (RelOptInfo *) lfirst(lc);

Thanks for that.

* map_and_merge_partitions()

Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
to suppress compiler warnings) I think that this part needs more thought:

{
Assert(mergemap1[index1] != mergemap2[index2] &&
mergemap1[index1] >= 0 && mergemap2[index2] >= 0);

/*
* Both the partitions map to different merged partitions. This
* means that multiple partitions from one relation matches to one
* partition from the other relation. Partition-wise join does not
* handle this case right now, since it requires ganging multiple
* partitions together (into one RelOptInfo).
*/
merged_index = -1;
}

I could hit this path with the following test:

CREATE TABLE a(i int) PARTITION BY LIST(i);
CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
CREATE TABLE b(j int) PARTITION BY LIST(j);
CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);

SET enable_partition_wise_join TO on;

SELECT *
FROM a
FULL JOIN
b ON i = j;

I don't think there's a reason not to join a_0 partition to b_0, is there?

With the latest patchset I am seeing that partition-wise join is used
in this case. I have started a new thread [1]/messages/by-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com for advanced partition
matching patches. Please post review comments about the last two
patches on that thread.

[1]: /messages/by-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com

Attached patchset with above comments addressed.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v28.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v28.tar.gz
#189Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#187)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/04 12:38, Etsuro Fujita wrote:

On 2017/09/02 4:10, Ashutosh Bapat wrote:

This rebase mainly changes patch 0001, which translates partition
hierarchy into inheritance hierarchy creating AppendRelInfos and
RelOptInfos for partitioned partitions. Because of that, it's not
necessary to record the partitioned partitions in a
PartitionedChildRelInfos::child_rels. The only RTI that goes in there
is the RTI of child RTE which is same as the parent RTE except inh
flag. I tried removing that with a series of changes but it seems that
following code in ExecInitModifyTable() requires it.
1897 /* The root table RT index is at the head of the
partitioned_rels list */
1898 if (node->partitioned_rels)
1899 {
1900 Index root_rti;
1901 Oid root_oid;
1902
1903 root_rti = linitial_int(node->partitioned_rels);
1904 root_oid = getrelid(root_rti, estate->es_range_table);
1905 rel = heap_open(root_oid, NoLock); /* locked by InitPlan */
1906 }
1907 else
1908 rel = mtstate->resultRelInfo->ri_RelationDesc;

I don't know whether we could change this code not to use
PartitionedChildRelInfos::child_rels.

For a root partitioned tables, ModifyTable.partitioned_rels comes from
PartitionedChildRelInfo.child_rels recorded for the table by
expand_inherited_rtnentry(). In fact, the latter is copied verbatim to
ModifyTablePath (or AppendPath/MergeAppendPath) when creating the same.
The only point of keeping those RT indexes around in the ModifyTable node
is for the executor to be able to locate partitioned table RT entries and
lock them. Without them, the executor wouldn't know about those tables at
all, because there won't be subplans corresponding to partitioned tables
in the tree and hence their RT indexes won't appear in the
ModifyTable.resultRelations list. If your patch adds partitioned child
rel AppendRelInfos back for whatever reason, you should also make sure in
inheritance_planner() that their RT indexes don't end up the
resultRelations list. See this piece of code in inheritance_planner():

1351 /* Build list of sub-paths */
1352 subpaths = lappend(subpaths, subpath);
1353
1354 /* Build list of modified subroots, too */
1355 subroots = lappend(subroots, subroot);
1356
1357 /* Build list of target-relation RT indexes */
1358 resultRelations = lappend_int(resultRelations,
appinfo->child_relid);

Maybe it won't happen, because if this appinfo corresponds to a
partitioned child table, recursion would occur and we'll get to this piece
of code for only the leaf children.

You are right. We don't execute above lines for partitioned partitions.

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table. Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code. We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)

Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1]. I'll add that patch to
the next commitfest.

+1.

+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#190Etsuro Fujita
Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#189)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/04 21:32, Ashutosh Bapat wrote:

On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table. Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code. We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions, I don't think
we need the list to take care of the locks.

I don't think so either. (Since I haven't followed discussions on this
thread in detail yet, I don't understand the idea/need of creating
AppendRelInfos for partitioned partitions, though.)

Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1]. I'll add that patch to
the next commitfest.

+1.

+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

No. The patch just removes the partitioned_rels list from
nodeModifyTable.c.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#191Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#189)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/04 21:32, Ashutosh Bapat wrote:

On Mon, Sep 4, 2017 at 10:04 AM, Amit Langote wrote:

By the way, if you want to get rid of PartitionedChildRelInfo, you can do
that as long as you find some other way of putting together the
partitioned_rels list to add into the ModifyTable (Append/MergeAppend)
node created for the root partitioned table. Currently,
PartitionedChildRelInfo (and the root->pcinfo_list) is the way for
expand_inherited_rtentry() to pass that information to the planner's
path-generating code. We may be able to generate that list when actually
creating the path using set_append_rel_pathlist() or
inheritance_planner(), without having created a PartitionedChildRelInfo
node beforehand.

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)

We do *need* the list in ModifyTable (Append/MergeAppend) node itself. We
can, however, get rid of the PartitionedChildRelInfo node that carries the
partitioned child RT indexes from an earlier planning phase
(expand_inherited_rtentry) to a later phase
(create_{modifytable|append|merge_append}_path). The later phase can
build that list from the AppendRelInfos that you mention we now [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154 build.

Though I haven't read the patch yet, I think the above code is useless.
And I proposed a patch to clean it up before [1]. I'll add that patch to
the next commitfest.

+1.

+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't
say that partitioned_rels itself is useless, just that its particular
usage in ExecInitModifyTable is. We still need that list for planner to
tell the executor that there are some RT entries the latter would need to
lock before executing a given plan. Without that dedicated list, the
executor cannot know at all that certain tables in the partition tree
(viz. the partitioned ones) need to be locked. I mentioned the reason -
(Merge)Append.subplans, ModifyTable.resultRelations does not contain
respective entries corresponding to the partitioned tables, and
traditionally, the executor looks at those lists to figure out the tables
to lock.

Thanks,
Amit

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#192Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#191)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/05 13:20, Amit Langote wrote:

The later phase can
build that list from the AppendRelInfos that you mention we now [1] build.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154

Looking at that commit again, AppendRelInfos are still not created for
partitioned child tables. Looking at the code in
expand_single_inheritance_child(), which exists as of 30833ba154:

/*
* Build an AppendRelInfo for this parent and child, unless the child is a
* partitioned table.
*/
if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
{
...code that builds AppendRelInfo...
}
else
*partitioned_child_rels = lappend_int(*partitioned_child_rels,
childRTindex);

you can see that an AppendRelInfo won't get built for partitioned child
tables.

Also, even if the commit changed things so that the child RT entries (and
AppendRelInfos) now get built in an order determined by depth-first
traversal of the partition tree, the same original parent RT index is used
to mark all AppendRelInfos, so the expansion essentially flattens the
hierarchy. In the updated patch I will post on the "path toward faster
partition pruning" thread [1]/messages/by-id/044e2e09-9690-7aff-1749-2d318da38a11@lab.ntt.co.jp, I am planning to rejigger things so that
two things start to happen:

1. For partitioned child tables, build the child RT entry with inh = true
and also build an AppendRelInfos

2. When recursively expanding a partitioned child table in
expand_partitioned_rtentry(), pass its new RT index as the
parentRTindex to the recursive call of expand_partitioned_rtentry(), so
that the resulting AppendRelInfos reflect immediate parent-child
relationship

With 1 in place, build_simple_rel() will build RelOptInfos even for
partitioned child tables, so that for each one, we can recursively build
an Append path. So, instead of just one Append path for the root
partitioned table, there is one for each partitioned table in the tree.

I will be including the above described change in the partition-pruning
patch, because the other code in that patch relies on the same and I know
Ashuotsh has wanted that for a long time. :)

Thanks,
Amit

[1]: /messages/by-id/044e2e09-9690-7aff-1749-2d318da38a11@lab.ntt.co.jp
/messages/by-id/044e2e09-9690-7aff-1749-2d318da38a11@lab.ntt.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#193Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#192)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 5, 2017 at 11:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/05 13:20, Amit Langote wrote:

The later phase can
build that list from the AppendRelInfos that you mention we now [1] build.

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=30833ba154

Looking at that commit again, AppendRelInfos are still not created for
partitioned child tables. Looking at the code in
expand_single_inheritance_child(), which exists as of 30833ba154:

/*
* Build an AppendRelInfo for this parent and child, unless the child is a
* partitioned table.
*/
if (childrte->relkind != RELKIND_PARTITIONED_TABLE && !childrte->inh)
{
...code that builds AppendRelInfo...
}
else
*partitioned_child_rels = lappend_int(*partitioned_child_rels,
childRTindex);

you can see that an AppendRelInfo won't get built for partitioned child
tables.

Also, even if the commit changed things so that the child RT entries (and
AppendRelInfos) now get built in an order determined by depth-first
traversal of the partition tree, the same original parent RT index is used
to mark all AppendRelInfos, so the expansion essentially flattens the
hierarchy. In the updated patch I will post on the "path toward faster
partition pruning" thread [1], I am planning to rejigger things so that
two things start to happen:

1. For partitioned child tables, build the child RT entry with inh = true
and also build an AppendRelInfos

2. When recursively expanding a partitioned child table in
expand_partitioned_rtentry(), pass its new RT index as the
parentRTindex to the recursive call of expand_partitioned_rtentry(), so
that the resulting AppendRelInfos reflect immediate parent-child
relationship

With 1 in place, build_simple_rel() will build RelOptInfos even for
partitioned child tables, so that for each one, we can recursively build
an Append path. So, instead of just one Append path for the root
partitioned table, there is one for each partitioned table in the tree.

I will be including the above described change in the partition-pruning
patch, because the other code in that patch relies on the same and I know
Ashuotsh has wanted that for a long time. :)

Those changes are already part of my updated 0001 patch. Aren't they?
May be you should just review those and see if those are suitable for
you?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#194Etsuro Fujita
Etsuro Fujita
fujita.etsuro@lab.ntt.co.jp
In reply to: Amit Langote (#191)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/05 13:20, Amit Langote wrote:

On 2017/09/04 21:32, Ashutosh Bapat wrote:

+1. Will Fujita-san's patch also handle getting rid of partitioned_rels list?

As Fujita-san mentioned, his patch won't. Actually, I suppose he didn't
say that partitioned_rels itself is useless, just that its particular
usage in ExecInitModifyTable is.

That's right. (I thought there would probably be no need to create that
list if we created AppendRelInfos even for partitioned partitions.)

We still need that list for planner to
tell the executor that there are some RT entries the latter would need to
lock before executing a given plan. Without that dedicated list, the
executor cannot know at all that certain tables in the partition tree
(viz. the partitioned ones) need to be locked. I mentioned the reason -
(Merge)Append.subplans, ModifyTable.resultRelations does not contain
respective entries corresponding to the partitioned tables, and
traditionally, the executor looks at those lists to figure out the tables
to lock.

I think so too.

Best regards,
Etsuro Fujita

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#195Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#193)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/05 15:30, Ashutosh Bapat wrote:

Those changes are already part of my updated 0001 patch. Aren't they?
May be you should just review those and see if those are suitable for
you?

Yeah, I think it's going to be the same patch, functionality-wise.

And sorry, I didn't realize you were talking about the case after applying
your patch on HEAD.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#196Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#195)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 5, 2017 at 12:06 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/05 15:30, Ashutosh Bapat wrote:

Those changes are already part of my updated 0001 patch. Aren't they?
May be you should just review those and see if those are suitable for
you?

Yeah, I think it's going to be the same patch, functionality-wise.

And sorry, I didn't realize you were talking about the case after applying
your patch on HEAD.

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

Having asked that, I think my patch shouldn't deal with removing
partitioned_rels lists and related structures and code. It should be
done as a separate patch.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#197Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#196)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/05 15:43, Ashutosh Bapat wrote:

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

AppendRelInfos exist within the planner (they come to be and go away
within the planner). Once we leave the planner, that information is gone.

Executor will receive a plan node that won't contain that information:

1. Append has an appendplans field, which contains one plan tree for every
leaf partition. None of its fields, other than partitined_rels,
contains the RT indexes of the partitioned tables. Similarly in the
case of MergeAppend.

2. ModifyTable has a resultRelations fields which contains a list of leaf
partition RT indexes and a plans field which contains one plan tree for
every RT index in the resultRelations list (that is a plan tree that
will scan the particular leaf partition). None of its fields, other
than partitined_rels, contains the RT indexes of the partitioned
tables.

I learned over the course of developing the patch that added this
partitioned_rels field [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8 that the executor needs to identify all the
affected tables by a given plan tree so that it could lock them. Executor
needs to lock them separately even if the plan itself was built after
locking all the relevant tables [2]/messages/by-id/CA+TgmoYiwviCDRi3Zk+QuXj1r7uMu9T_kDNq+17PCWgzrbzw8A@mail.gmail.com. For example, see
ExecLockNonLeafAppendTables(), which will lock the tables in the
(Merge)Append.partitioned_rels list.

While I've been thinking all along that the same thing must be happening
for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
times on this thread), it's actually not. Instead,
ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
merged into PlannedStmt.nonleafResultRelations (which happens in
set_plan_refs) and that's where the executor finds them to lock them
(which happens in InitPlan).

So, it appears that ModifyTable.partitioned_rels is indeed unused in the
executor. But we still can't get rid of it from the ModifyTable node
itself without figuring out a way (a channel) to transfer that information
into PlannedStmt.nonleafResultRelations.

Having asked that, I think my patch shouldn't deal with removing
partitioned_rels lists and related structures and code. It should be> done as a separate patch.

Going back to your original email which started this discussion, it seems
that we agree on that the PartitionedChildRelInfo node can be removed, and
I agree that it shouldn't be done in the partitionwise-join patch series
but as a separate patch. As described above, we shouldn't try yet to get
rid of the partitioned_rels list that appears in some plan nodes.

Thanks,
Amit

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d3cc37f1d8

[2]: /messages/by-id/CA+TgmoYiwviCDRi3Zk+QuXj1r7uMu9T_kDNq+17PCWgzrbzw8A@mail.gmail.com
/messages/by-id/CA+TgmoYiwviCDRi3Zk+QuXj1r7uMu9T_kDNq+17PCWgzrbzw8A@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#198Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#197)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/05 15:43, Ashutosh Bapat wrote:

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

AppendRelInfos exist within the planner (they come to be and go away
within the planner). Once we leave the planner, that information is gone.

Executor will receive a plan node that won't contain that information:

1. Append has an appendplans field, which contains one plan tree for every
leaf partition. None of its fields, other than partitined_rels,
contains the RT indexes of the partitioned tables. Similarly in the
case of MergeAppend.

2. ModifyTable has a resultRelations fields which contains a list of leaf
partition RT indexes and a plans field which contains one plan tree for
every RT index in the resultRelations list (that is a plan tree that
will scan the particular leaf partition). None of its fields, other
than partitined_rels, contains the RT indexes of the partitioned
tables.

I learned over the course of developing the patch that added this
partitioned_rels field [1] that the executor needs to identify all the
affected tables by a given plan tree so that it could lock them. Executor
needs to lock them separately even if the plan itself was built after
locking all the relevant tables [2]. For example, see
ExecLockNonLeafAppendTables(), which will lock the tables in the
(Merge)Append.partitioned_rels list.

While I've been thinking all along that the same thing must be happening
for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
times on this thread), it's actually not. Instead,
ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
merged into PlannedStmt.nonleafResultRelations (which happens in
set_plan_refs) and that's where the executor finds them to lock them
(which happens in InitPlan).

So, it appears that ModifyTable.partitioned_rels is indeed unused in the
executor. But we still can't get rid of it from the ModifyTable node
itself without figuring out a way (a channel) to transfer that information
into PlannedStmt.nonleafResultRelations.

Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am
not adding any partitioned partition other than the parent itself. But
since every partitioned partition in turn acts as parent, it appears
its own list. The list obtained by concatenating all such lists
together contains all the partitioned partition RTIs. In my patch, I
need to teach accumulate_append_subpath() to accumulate
partitioned_rels as well.

Having asked that, I think my patch shouldn't deal with removing
partitioned_rels lists and related structures and code. It should be> done as a separate patch.

Going back to your original email which started this discussion, it seems
that we agree on that the PartitionedChildRelInfo node can be removed, and
I agree that it shouldn't be done in the partitionwise-join patch series
but as a separate patch. As described above, we shouldn't try yet to get
rid of the partitioned_rels list that appears in some plan nodes.

Thanks.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#199Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#198)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 5, 2017 at 3:00 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Sep 5, 2017 at 1:16 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/05 15:43, Ashutosh Bapat wrote:

Ok. Can you please answer my previous questions?

AFAIU, the list contained RTIs of the relations, which didnt' have
corresponding AppendRelInfos to lock those relations. Now that we
create AppendRelInfos even for partitioned partitions with my 0001
patch, I don't think
we need the list to take care of the locks. Is there any other reason
why we maintain that list (apart from the trigger case I have raised
and Fujita-san says that the list is not required in that case as
well.)?

AppendRelInfos exist within the planner (they come to be and go away
within the planner). Once we leave the planner, that information is gone.

Executor will receive a plan node that won't contain that information:

1. Append has an appendplans field, which contains one plan tree for every
leaf partition. None of its fields, other than partitined_rels,
contains the RT indexes of the partitioned tables. Similarly in the
case of MergeAppend.

2. ModifyTable has a resultRelations fields which contains a list of leaf
partition RT indexes and a plans field which contains one plan tree for
every RT index in the resultRelations list (that is a plan tree that
will scan the particular leaf partition). None of its fields, other
than partitined_rels, contains the RT indexes of the partitioned
tables.

I learned over the course of developing the patch that added this
partitioned_rels field [1] that the executor needs to identify all the
affected tables by a given plan tree so that it could lock them. Executor
needs to lock them separately even if the plan itself was built after
locking all the relevant tables [2]. For example, see
ExecLockNonLeafAppendTables(), which will lock the tables in the
(Merge)Append.partitioned_rels list.

While I've been thinking all along that the same thing must be happening
for RT indexes in ModifyTable.partitioned_rels list (I said so a couple of
times on this thread), it's actually not. Instead,
ModifyTable.partitioned_rels of all ModifyTable nodes in a give query are
merged into PlannedStmt.nonleafResultRelations (which happens in
set_plan_refs) and that's where the executor finds them to lock them
(which happens in InitPlan).

So, it appears that ModifyTable.partitioned_rels is indeed unused in the
executor. But we still can't get rid of it from the ModifyTable node
itself without figuring out a way (a channel) to transfer that information
into PlannedStmt.nonleafResultRelations.

Thanks a lot for this detailed analysis. IIUC, in my 0001 patch, I am
not adding any partitioned partition other than the parent itself. But
since every partitioned partition in turn acts as parent, it appears
its own list. The list obtained by concatenating all such lists
together contains all the partitioned partition RTIs. In my patch, I
need to teach accumulate_append_subpath() to accumulate
partitioned_rels as well.

accumulate_append_subpath() is executed for every path instead of
every relation, so changing it would collect the same list multiple
times. Instead, I found the old way of associating all intermediate
partitions with the root partitioned relation work better. Here's the
updated patch set.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v29.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v29.tar.gz
#200Antonin Houska
Antonin Houska
ah@cybertec.at
In reply to: Ashutosh Bapat (#188)
Re: Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah@cybertec.at> wrote:

* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.

The assert at the end of function also checks that we have got
child_rels lists for all the parents passed in.

Really? I can imagine that some instances of PartitionedChildRelInfo have the
child_rels list empty, while other ones have these lists long enough to
compensate for the empty lists.

* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
* The equi-join between partition keys is strict if equi-join between
* at least one partition key is using a strict operator. See
* explanation about outer join reordering identity 3 in
* optimizer/README
*/
strict_op = op_strict(opexpr->opno);

What in that comment is not exactly relevant?

Basically I don't understand why you mention join reordering here. The join
ordering questions must all have been resolved by the time
have_partkey_equi_join() is called.

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.

I think it does that. Do you have a case where it doesn't?

Here I refer to this part of the comment above:

"... if equi-join between at least one partition key is using a strict
operator."

My understanding of the code (especially match_expr_to_partition_keys) is that
no operator actually needs to be strict as long as each operator involved in
the join matches at least one non-nullable expression on both sides of the
join.

* match_expr_to_partition_keys()

I'm not sure this comment is clear enough:

/*
* If it's a strict equi-join a NULL partition key on one side will
* not join a NULL partition key on the other side. So, rows with NULL
* partition key from a partition on one side can not join with those
* from a non-matching partition on the other side. So, search the
* nullable partition keys as well.
*/
if (!strict_op)
continue;

My understanding of the problem of NULL values generated by outer join is:
these NULL values --- if evaluated by non-strict expression --- can make row
of N-th partition on one side of the join match row(s) of *other than* N-th
partition(s) on the other side. Thus the nullable input expressions may only
be evaluated by strict operators. I think it'd be clearer if you stressed that
(undesired) *match* of partition keys can be a problem, rather than mismatch

Sorry, I am not able to understand this. To me it looks like my
wording conveys what you are saying.

I just tried to expreess the idea in a way that is clearer to me. I think we
both mean the same. Not sure I should spend more effort on another version of
the comment.

If you insist on your wording, then I think you should at least move the
comment below to the part that only deals with strict operators.

Done.

o.k.

* map_and_merge_partitions()

Besides a few changes proposed in map_and_merge_partitions.diff (a few of them
to suppress compiler warnings) I think that this part needs more thought:

{
Assert(mergemap1[index1] != mergemap2[index2] &&
mergemap1[index1] >= 0 && mergemap2[index2] >= 0);

/*
* Both the partitions map to different merged partitions. This
* means that multiple partitions from one relation matches to one
* partition from the other relation. Partition-wise join does not
* handle this case right now, since it requires ganging multiple
* partitions together (into one RelOptInfo).
*/
merged_index = -1;
}

I could hit this path with the following test:

CREATE TABLE a(i int) PARTITION BY LIST(i);
CREATE TABLE a_0 PARTITION OF a FOR VALUES IN (0, 2);
CREATE TABLE b(j int) PARTITION BY LIST(j);
CREATE TABLE b_0 PARTITION OF b FOR VALUES IN (1, 2);

SET enable_partition_wise_join TO on;

SELECT *
FROM a
FULL JOIN
b ON i = j;

I don't think there's a reason not to join a_0 partition to b_0, is there?

With the latest patchset I am seeing that partition-wise join is used
in this case. I have started a new thread [1] for advanced partition
matching patches.

What plan do you get, with the patches from

/messages/by-id/CAFjFpRfdXpuSu0pxON3dKcr8WndJkaXLzHUVax_Laod0Tgc6UQ@mail.gmail.com

I still see the join above Append, not below:

QUERY PLAN
-------------------------------------------------------------------------
Merge Full Join (cost=359.57..860.00 rows=32512 width=8)
Merge Cond: (a_0.i = b_0.j)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: a_0.i
-> Append (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on a_0 (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: b_0.j
-> Append (cost=0.00..35.50 rows=2550 width=4)
-> Seq Scan on b_0 (cost=0.00..35.50 rows=2550 width=4)

Please post review comments about the last two patches on that thread.

ok, I'll do if I find any problem.

[1] /messages/by-id/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#201Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Antonin Houska (#200)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 7, 2017 at 4:32 PM, Antonin Houska <ah@cybertec.at> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Sep 1, 2017 at 6:05 PM, Antonin Houska <ah@cybertec.at> wrote:

* get_partitioned_child_rels_for_join()

I think the Assert() statement is easier to understand inside the loop, see
the assert.diff attachment.

The assert at the end of function also checks that we have got
child_rels lists for all the parents passed in.

Really? I can imagine that some instances of PartitionedChildRelInfo have the
child_rels list empty, while other ones have these lists long enough to
compensate for the empty lists.

That isn't true. Each child_rels list will at least have one entry.
Please see get_partitioned_child_rels().

* have_partkey_equi_join()

As the function handles generic join, this comment doesn't seem to me
relevant:

/*
* The equi-join between partition keys is strict if equi-join between
* at least one partition key is using a strict operator. See
* explanation about outer join reordering identity 3 in
* optimizer/README
*/
strict_op = op_strict(opexpr->opno);

What in that comment is not exactly relevant?

Basically I don't understand why you mention join reordering here. The join
ordering questions must all have been resolved by the time
have_partkey_equi_join() is called.

I am referring to a particular section in README which talks about the
relation between strict operator and legal join order.

And I think the function can return true even if strict_op is false for all
the operators evaluated in the loop.

I think it does that. Do you have a case where it doesn't?

Here I refer to this part of the comment above:

"... if equi-join between at least one partition key is using a strict
operator."

My understanding of the code (especially match_expr_to_partition_keys) is that
no operator actually needs to be strict as long as each operator involved in
the join matches at least one non-nullable expression on both sides of the
join.

I don't think so. A strict operator returns NULL when either of the
inputs is NULL. We can not say so for non-strict operators, which may
deem NULL and non-NULL arguments as equal, even though that looks
insane.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#202Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#199)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

accumulate_append_subpath() is executed for every path instead of
every relation, so changing it would collect the same list multiple
times. Instead, I found the old way of associating all intermediate
partitions with the root partitioned relation work better. Here's the
updated patch set.

When I tried out patch 0001, it crashed repeatedly during 'make check'
because of an assertion failure in get_partitioned_child_rels. It
seemed to me that the way the patch was refactoring
expand_inherited_rtentry involved more code rearrangement than
necessary, so I reverted all the code rearrangement and just kept the
functional changes, and all the crashes went away. (That refactoring
also failed to initialize has_child properly.) In so doing, I
reintroduced the problem that the PartitionedChildRelInfo lists
weren't getting set up correctly, but after some thought I realized
that was just because expand_single_inheritance_child() was choosing
between adding an RTE and adding the OID to partitioned_child_rels,
whereas for an intermediate partitioned table it needs to do both. So
I inserted a trivial fix for that problem (replacing "else" with a new
"if"-test), basically:

-    else
+
+    if (childrte->relkind == RELKIND_PARTITIONED_TABLE)

Please check out the attached version of the patch. In addition to
the above simplifications, I did some adjustments to the comments in
various places - some just grammar and others a bit more substantive.
And I think I broke a long line in one place, too.

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes. If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch. That's certainly one of the subtler
parts of this patch, IMHO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

expand-stepwise-rmh.patchapplication/octet-stream; name=expand-stepwise-rmh.patch
#203Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#202)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/08 4:04, Robert Haas wrote:

On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

accumulate_append_subpath() is executed for every path instead of
every relation, so changing it would collect the same list multiple
times. Instead, I found the old way of associating all intermediate
partitions with the root partitioned relation work better. Here's the
updated patch set.

When I tried out patch 0001, it crashed repeatedly during 'make check'
because of an assertion failure in get_partitioned_child_rels. It
seemed to me that the way the patch was refactoring
expand_inherited_rtentry involved more code rearrangement than
necessary, so I reverted all the code rearrangement and just kept the
functional changes, and all the crashes went away. (That refactoring
also failed to initialize has_child properly.)

When I tried the attached patch, it doesn't seem to expand partitioning
inheritance in step-wise manner as the patch's title says. I think the
rewritten patch forgot to include Ashutosh's changes to
expand_single_inheritance_child() whereby the AppendRelInfo of the child
will be marked with the direct parent instead of always the root parent.

I updated the patch to include just those changes. I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps. But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread. We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert. The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion). In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough. But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes. If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch. That's certainly one of the subtler
parts of this patch, IMHO.

Back when this (step-wise expansion of partition inheritance) used to be a
patch in the original declarative partitioning patch series, Ashutosh had
reported a test query [1]/messages/by-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj=iqg@mail.gmail.com that would fail getting a plan, for which we
came up with the initsplan.c changes in this patch as the solution:

ERROR: could not devise a query plan for the given query

I tried that query again without the initsplan.c changes and somehow the
same error does not occur anymore. It's strange because without the
initsplan.c changes, there is no way for partitions lower in the tree than
the first level to get the direct_lateral_relids and lateral_relids from
the root parent rel. Maybe, Ashutosh has a way to devise the failing
query again.

I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set. Sorry about the wasted time.

Thanks,
Amit

[1]: /messages/by-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj=iqg@mail.gmail.com
/messages/by-id/CAFjFpReZF34MDbY95xoATi0xVj2mAry4-LHBWVBayOc8gj=iqg@mail.gmail.com

Attachments:

expand-stepwise-rmh-2.patchtext/plain; charset=UTF-8; name=expand-stepwise-rmh-2.patch
#204Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#203)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/08 14:47, Amit Langote wrote:

When I tried the attached patch, it doesn't seem to expand partitioning
inheritance in step-wise manner as the patch's title says.

Oops. By "attached patch", I had meant to say the Robert's patch, that
is, expand-stepwise-rmh.patch. Not expand-stepwise-rmh-2.patch, which is
the updated version of the patch attached with the quoted message.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#205Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#202)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 8, 2017 at 12:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

accumulate_append_subpath() is executed for every path instead of
every relation, so changing it would collect the same list multiple
times. Instead, I found the old way of associating all intermediate
partitions with the root partitioned relation work better. Here's the
updated patch set.

When I tried out patch 0001, it crashed repeatedly during 'make check'
because of an assertion failure in get_partitioned_child_rels.

Running "make check" on the whole patchset doesn't give that failure.
So I didn't notice the crash since I was running regression on the
whole patchset. Sorry for that. Fortunately git rebase -i allows us to
execute shell commands while applying patches, so I have set it up to
compile each patch and run regression. Hopefully that will catch such
errors in future. That process showed me that patch
0003-In-add_paths_to_append_rel-get-partitioned_rels-for-.patch fixes
that crash by calling get_partitioned_child_rels() only on the root
partitioned table for which we have set up child_rels list. Amit
Langote has a similar fix reported in his reply. So, we will discuss
it there.

It
seemed to me that the way the patch was refactoring
expand_inherited_rtentry involved more code rearrangement than
necessary, so I reverted all the code rearrangement and just kept the
functional changes, and all the crashes went away. (That refactoring
also failed to initialize has_child properly.) In so doing, I
reintroduced the problem that the PartitionedChildRelInfo lists
weren't getting set up correctly, but after some thought I realized
that was just because expand_single_inheritance_child() was choosing
between adding an RTE and adding the OID to partitioned_child_rels,
whereas for an intermediate partitioned table it needs to do both. So
I inserted a trivial fix for that problem (replacing "else" with a new
"if"-test), basically:

-    else
+
+    if (childrte->relkind == RELKIND_PARTITIONED_TABLE)

Please check out the attached version of the patch. In addition to
the above simplifications, I did some adjustments to the comments in
various places - some just grammar and others a bit more substantive.
And I think I broke a long line in one place, too.

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes. If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch. That's certainly one of the subtler
parts of this patch, IMHO.

Amit Langote has replied on these points as well. So, I will comment
in a reply to his reply.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#206Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#203)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 8, 2017 at 1:47 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

When I tried the attached patch, it doesn't seem to expand partitioning
inheritance in step-wise manner as the patch's title says. I think the
rewritten patch forgot to include Ashutosh's changes to
expand_single_inheritance_child() whereby the AppendRelInfo of the child
will be marked with the direct parent instead of always the root parent.

Woops.

I updated the patch to include just those changes. I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps. But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

We probably need some better comments to explain which things need to
be marked using the immediate parent and which need to be marked using
the baserel, and why.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread. We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert. The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion). In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough. But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

OK, so this needs some good comments, too...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#207Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#203)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/08 4:04, Robert Haas wrote:

On Tue, Sep 5, 2017 at 7:01 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

accumulate_append_subpath() is executed for every path instead of
every relation, so changing it would collect the same list multiple
times. Instead, I found the old way of associating all intermediate
partitions with the root partitioned relation work better. Here's the
updated patch set.

When I tried out patch 0001, it crashed repeatedly during 'make check'
because of an assertion failure in get_partitioned_child_rels. It
seemed to me that the way the patch was refactoring
expand_inherited_rtentry involved more code rearrangement than
necessary, so I reverted all the code rearrangement and just kept the
functional changes, and all the crashes went away. (That refactoring
also failed to initialize has_child properly.)

When I tried the attached patch, it doesn't seem to expand partitioning
inheritance in step-wise manner as the patch's title says. I think the
rewritten patch forgot to include Ashutosh's changes to
expand_single_inheritance_child() whereby the AppendRelInfo of the child
will be marked with the direct parent instead of always the root parent.

Right. If we apply 0002 from partition-wise join patchset, which has
changed build_simple_rel() to collect direct children of a given
partitioned table, it introduces another crash because of Assertion
failure; for a partitioned table build_simple_rel() finds more
children than expected because indirect children are also counted as
direct children.

I updated the patch to include just those changes. I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps. But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

Passing top parent's row mark works today, since there is no
parent-child specific translation happening there. But if in future,
we introduce such a translation, row marks for indirect children in a
multi-level partitioned hierarchy won't be accurate. So, I think it's
better to pass row marks of the direct parent.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread. We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert. The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion). In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough. But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

There was a change in my 0003 patch, which fixed the crash. It checked
for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
my 0001 patch. It no more crashes. I tried various queries involving
set operations and bare multi-level partitioned table scan with my
patch, but none of them showed any anomaly. Do you have a testcase
which shows problem with my patch? May be your suggestion is correct,
but corresponding code implementation is slightly longer than I would
expect. So, we should go with it, if there is corresponding testcase
which shows why it's needed.

In your patch
+            parent_rel = root->simple_rel_array[parent_relid];
+            get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);
Do you mean RTE_RELATION as you explained above?

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes. If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch. That's certainly one of the subtler
parts of this patch, IMHO.

Back when this (step-wise expansion of partition inheritance) used to be a
patch in the original declarative partitioning patch series, Ashutosh had
reported a test query [1] that would fail getting a plan, for which we
came up with the initsplan.c changes in this patch as the solution:

ERROR: could not devise a query plan for the given query

I tried that query again without the initsplan.c changes and somehow the
same error does not occur anymore. It's strange because without the
initsplan.c changes, there is no way for partitions lower in the tree than
the first level to get the direct_lateral_relids and lateral_relids from
the root parent rel. Maybe, Ashutosh has a way to devise the failing
query again.

Thanks a lot for the reference. I devised a testcase slightly
modifying my original test. I have included the test in the latest
patch set.

I have included Robert's changes to parts other than
expand_inherited_rtentry() in the patch.

I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set. Sorry about the wasted time.

Thanks a lot. Please review the patch in the updated patchset.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v30.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v30.tar.gz
#208Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#207)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set. Sorry about the wasted time.

Thanks a lot. Please review the patch in the updated patchset.

In set_append_rel_size(), I don't find the comment too clear (and this
part was taken from Amit's patch, right?). I suggest:

+    /*
+     * Associate the partitioned tables which are descendents of the table
+     * named in the query with the topmost append path (i.e. the one where
+     * rel->reloptkind is RELOPT_BASEREL).  This ensures that they get properly
+     * locked at execution time.
+     */

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks. If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow. Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti. But if that's true everywhere, then why even
change this? I think we might be well off not to tinker with things
that don't need to be changed.

Apart from that concern, now that I understand (from my own failed
attempt and some off-list discussion) why this patch works the way it
does, I think this is in fairly good shape.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#209Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#207)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/09 2:38, Ashutosh Bapat wrote:

On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:

I updated the patch to include just those changes. I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps. But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

Passing top parent's row mark works today, since there is no
parent-child specific translation happening there. But if in future,
we introduce such a translation, row marks for indirect children in a
multi-level partitioned hierarchy won't be accurate. So, I think it's
better to pass row marks of the direct parent.

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread. We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert. The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion). In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough. But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

There was a change in my 0003 patch, which fixed the crash. It checked
for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
my 0001 patch. It no more crashes. I tried various queries involving
set operations and bare multi-level partitioned table scan with my
patch, but none of them showed any anomaly. Do you have a testcase
which shows problem with my patch? May be your suggestion is correct,
but corresponding code implementation is slightly longer than I would
expect. So, we should go with it, if there is corresponding testcase
which shows why it's needed.

If we go with your patch, partitioned tables won't get locked, for
example, in case of the following query (p is a partitioned table):

select 1 from p union all select 2 from p;

That's because the RelOptInfos for the two instances of p in the above
query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
of the Append corresponding to the UNION ALL subquery RTE. So,
partitioned_rels does not get set per your proposed code.

In your patch

+            parent_rel = root->simple_rel_array[parent_relid];
+            get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);

Do you mean RTE_RELATION as you explained above?

No, I mean RTE_SUBQUERY.

If the partitioned table RTE in question corresponds to one named in the
query, we should be able to find its pcinfo in root->pcinfo_list. If the
partitioned table RTE is one added as result of inheritance expansion, it
won't have an associated pcinfo. So, we should find a way to distinguish
them from one another. The first idea that had occurred to me was the
same as yours -- RelOptInfo of the partitioned table RTE named in the
query would be of reloptkind RELOPT_BASEREL and those of the partitioned
table RTE added as result of inheritance expansion will be of reloptkind
RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former
is not. If the partitioned table named in the query appears under UNION
ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That
means we have to use some other means to distinguish partitioned table
RTEs that have an associated pcinfo from those that don't. So, I devised
this method of looking at the parent RTE (if any) for distinguishing the
two. Partitioned table named in the query either doesn't have the parent
or if it does, the parent could only ever be a UNION ALL subquery
(RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion
will always have the parent and the parent will only ever be a table
(RTE_RELATION).

One thing I notice is that if I rip out the changes to initsplan.c,
the new regression test still passes. If it's possible to write a
test that fails without those changes, I think it would be a good idea
to include one in the patch. That's certainly one of the subtler
parts of this patch, IMHO.

Back when this (step-wise expansion of partition inheritance) used to be a
patch in the original declarative partitioning patch series, Ashutosh had
reported a test query [1] that would fail getting a plan, for which we
came up with the initsplan.c changes in this patch as the solution:

ERROR: could not devise a query plan for the given query

I tried that query again without the initsplan.c changes and somehow the
same error does not occur anymore. It's strange because without the
initsplan.c changes, there is no way for partitions lower in the tree than
the first level to get the direct_lateral_relids and lateral_relids from
the root parent rel. Maybe, Ashutosh has a way to devise the failing
query again.

Thanks a lot for the reference. I devised a testcase slightly
modifying my original test. I have included the test in the latest
patch set.

I have included Robert's changes to parts other than
expand_inherited_rtentry() in the patch.

I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set. Sorry about the wasted time.

Thanks a lot. Please review the patch in the updated patchset.

Some comments:

In create_lateral_join_info():

+ Assert(IS_SIMPLE_REL(brel));
+ Assert(brte);

The second Assert is either unnecessary or should be placed first.

The following comment could be made a bit clearer.

+         * In the case of table inheritance, the parent RTE is directly
linked
+         * to every child table via an AppendRelInfo.  In the case of table
+         * partitioning, the inheritance hierarchy is expanded one level at a
+         * time rather than flattened.  Therefore, an other member rel
that is
+         * a partitioned table may have children of its own, and must
+         * therefore be marked with the appropriate lateral info so that
those
+         * children eventually get marked also.

How about: In the case of partitioned table inheritance, the original
parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
Partitions below the first level are accessible only via their immediate
parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
consider those as well.

In expand_inherited_rtentry(), the following comment fragment is obsolete,
because we *do* now create AppendRelInfo's for partitioned children:

+        /*
+         * We keep a list of objects in root, each of which maps a
partitioned
+         * parent RT index to the list of RT indexes of its partitioned child
+         * tables which do not have AppendRelInfos associated with those.

By the way, when we call expand_single_inheritance_child() in the
non-partitioned inheritance case, we should pass NULL for childrte_p,
childRTindex_p, childrc_p, instead of declaring variables that won't be
used. Hence, expand_single_inheritance_child() should make those
arguments optional.

+
+    /*
+     * If the partitioned table has no partitions or all the partitions are
+     * temporary tables from other backends, treat this as non-inheritance
+     * case.
+     */
+    if (!has_child)
+        parentrte->inh = false;

I guess the above applies to all partitioned tables in the tree, so, I
think we should update the comment in set_rel_size():

else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* A partitioned table without leaf partitions is marked
* as a dummy rel.
*/
set_dummy_rel_pathlist(rel);
}

to say: a partitioned table without partitions is marked as a dummy rel.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#210Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#208)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/09 9:58, Robert Haas wrote:

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks. If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow. Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti.

Yes, it doesn't matter. The important point though is that nothing we
want to do in the short term requires us to set a child PlanRowMark's prti
to its immediate parent's RT index, as I also mentioned in reply to Ashutosh.

But if that's true everywhere, then why even
change this? I think we might be well off not to tinker with things
that don't need to be changed.

+1.

Apart from that concern, now that I understand (from my own failed
attempt and some off-list discussion) why this patch works the way it
does, I think this is in fairly good shape.

I too think so, except we still need to incorporate changes to
add_paths_to_append_rel() necessary to correctly set partitioned_rels, as
I explained in reply Ashutosh.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#211Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#208)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 8, 2017 at 1:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I also confirmed that the partition-pruning patch set works fine with this
patch instead of the patch on that thread with the same functionality,
which I will now drop from that patch set. Sorry about the wasted time.

Thanks a lot. Please review the patch in the updated patchset.

In set_append_rel_size(), I don't find the comment too clear (and this
part was taken from Amit's patch, right?). I suggest:

No, I didn't take it from Amit's patch. Both of us have different
wordings. But yours is better than both of us. Included it in the
attached patches.

+    /*
+     * Associate the partitioned tables which are descendents of the table
+     * named in the query with the topmost append path (i.e. the one where
+     * rel->reloptkind is RELOPT_BASEREL).  This ensures that they get properly
+     * locked at execution time.
+     */

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks. If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow. Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti. But if that's true everywhere, then why even
change this? I think we might be well off not to tinker with things
that don't need to be changed.

In the definition of ExecRowMark, I see
Index prti; /* parent range table index, if child */
It just says parent, by which I take as direct parent. For
inheritance, which earlier flattened inheritance hierarchy, direct
parent was top parent. So, probably nobody thought whether a parent is
direct parent or top parent. But now that we have introduced that
concept we need to interpret this comment anew. And I think
interpreting it as direct parent is non-lossy. If we set top parent's
index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
So, it looks quite natural that we set the direct parent's index in
PlanRowMark. From that POV, we aren't changing anything, we are
setting the same parent RTI in AppendRelInfo and PlanRowMark. Chaning
different parent RTIs in those two structure would be a real change.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#212Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#211)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/11 16:23, Ashutosh Bapat wrote:

On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks. If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow. Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti. But if that's true everywhere, then why even
change this? I think we might be well off not to tinker with things
that don't need to be changed.

In the definition of ExecRowMark, I see
Index prti; /* parent range table index, if child */
It just says parent, by which I take as direct parent. For
inheritance, which earlier flattened inheritance hierarchy, direct
parent was top parent. So, probably nobody thought whether a parent is
direct parent or top parent. But now that we have introduced that
concept we need to interpret this comment anew. And I think
interpreting it as direct parent is non-lossy.

But then we also don't have anything to say about why we're making that
change. If you could describe what non-lossy is in this context, then
fine. But that we'd like to match with what we're going to do for
AppendRelInfos does not seem to be a sufficient explanation for this change.

If we set top parent's
index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
So, it looks quite natural that we set the direct parent's index in
PlanRowMark.

They would not agree, yes, but aren't they unrelated? If we have a reason
for them to agree, (for example, row-locking breaks in the inherited table
case if we didn't), then we should definitely make them agree.

Updating the comment for prti definition might be something that this
patch could (should?) do, but I'm not quite sure about that too.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#213Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#209)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/09 2:38, Ashutosh Bapat wrote:

On Fri, Sep 8, 2017 at 11:17 AM, Amit Langote wrote:

I updated the patch to include just those changes. I'm not sure about
one of the Ashutosh's changes whereby the child PlanRowMark is also passed
to expand_partitioned_rtentry() to use as the parent PlanRowMark. I think
the child RTE, child RT index and child Relation are fine, because they
are necessary for creating AppendRelInfos in a desired way for later
planning steps. But PlanRowMarks are not processed within the planner
afterwards and do not need to be marked with the immediate parent-child
association in the same way that AppendRelInfos need to be.

Passing top parent's row mark works today, since there is no
parent-child specific translation happening there. But if in future,
we introduce such a translation, row marks for indirect children in a
multi-level partitioned hierarchy won't be accurate. So, I think it's
better to pass row marks of the direct parent.

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

I also included the changes to add_paths_to_append_rel() from my patch on
the "path toward faster partition pruning" thread. We'd need that change,
because while add_paths_to_append_rel() is called for all partitioned
table RTEs in a given partition tree, expand_inherited_rtentry() would
have set up a PartitionedChildRelInfo only for the root parent, so
get_partitioned_child_rels() would not find the same for non-root
partitioned table rels and crash failing the Assert. The changes I made
are such that we call get_partitioned_child_rels() only for the parent
rels that are known to correspond root partitioned tables (or as you
pointed out on the thread, "the table named in the query" as opposed those
added to the query as result of inheritance expansion). In addition to
the relkind check on the input RTE, it would seem that checking that the
reloptkind is RELOPT_BASEREL would be enough. But actually not, because
if a partitioned table is accessed in a UNION ALL query, reloptkind even
for the root partitioned table (the table named in the query) would be
RELOPT_OTHER_MEMBER_REL. The only way to confirm that the input rel is
actually the root partitioned table is to check whether its parent rel is
not RTE_RELATION, because the parent in case of UNION ALL Append is a
RTE_SUBQUERY RT entry.

There was a change in my 0003 patch, which fixed the crash. It checked
for RELOPT_BASEREL and RELKIND_PARTITIONED_TABLE. I have pulled it in
my 0001 patch. It no more crashes. I tried various queries involving
set operations and bare multi-level partitioned table scan with my
patch, but none of them showed any anomaly. Do you have a testcase
which shows problem with my patch? May be your suggestion is correct,
but corresponding code implementation is slightly longer than I would
expect. So, we should go with it, if there is corresponding testcase
which shows why it's needed.

If we go with your patch, partitioned tables won't get locked, for
example, in case of the following query (p is a partitioned table):

select 1 from p union all select 2 from p;

That's because the RelOptInfos for the two instances of p in the above
query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
of the Append corresponding to the UNION ALL subquery RTE. So,
partitioned_rels does not get set per your proposed code.

Session 1:
postgres=# begin;
BEGIN
postgres=# select 1 from t1 union all select 2 from t1;
?column?
----------
(0 rows)

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
28843
(1 row)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
locktype | relation | virtualxid | virtualtransaction | pid |
mode | granted | fastpath
------------+----------+------------+--------------------+-------+-----------------+---------+----------
relation | pg_locks | | 4/14 | 28854 |
AccessShareLock | t | t
virtualxid | | 4/14 | 4/14 | 28854 |
ExclusiveLock | t | t
relation | t1p1p1 | | 3/9 | 28843 |
AccessShareLock | t | t
relation | t1p1 | | 3/9 | 28843 |
AccessShareLock | t | t
relation | t1 | | 3/9 | 28843 |
AccessShareLock | t | t
virtualxid | | 3/9 | 3/9 | 28843 |
ExclusiveLock | t | t
(6 rows)

So, all partitioned partitions are getting locked correctly. Am I
missing something?

In your patch

+            parent_rel = root->simple_rel_array[parent_relid];
+            get_pcinfo = (parent_rel->rtekind == RTE_SUBQUERY);

Do you mean RTE_RELATION as you explained above?

No, I mean RTE_SUBQUERY.

If the partitioned table RTE in question corresponds to one named in the
query, we should be able to find its pcinfo in root->pcinfo_list. If the
partitioned table RTE is one added as result of inheritance expansion, it
won't have an associated pcinfo. So, we should find a way to distinguish
them from one another. The first idea that had occurred to me was the
same as yours -- RelOptInfo of the partitioned table RTE named in the
query would be of reloptkind RELOPT_BASEREL and those of the partitioned
table RTE added as result of inheritance expansion will be of reloptkind
RELOPT_OTHER_MEMBER_REL. Although the latter is always true, the former
is not. If the partitioned table named in the query appears under UNION
ALL query, then its reloptkind will be RELOPT_OTHER_MEMBER_REL. That
means we have to use some other means to distinguish partitioned table
RTEs that have an associated pcinfo from those that don't. So, I devised
this method of looking at the parent RTE (if any) for distinguishing the
two. Partitioned table named in the query either doesn't have the parent
or if it does, the parent could only ever be a UNION ALL subquery
(RTE_SUBQUERY). Partitioned tables added as part of inheritance expansion
will always have the parent and the parent will only ever be a table
(RTE_RELATION).

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?

In create_lateral_join_info():

+ Assert(IS_SIMPLE_REL(brel));
+ Assert(brte);

The second Assert is either unnecessary or should be placed first.

simple_rte_array[] may have some NULL entries. Second assert makes
sure that we aren't dealing with a NULL entry. Any particular reason
to reorder the asserts?

The following comment could be made a bit clearer.

+         * In the case of table inheritance, the parent RTE is directly
linked
+         * to every child table via an AppendRelInfo.  In the case of table
+         * partitioning, the inheritance hierarchy is expanded one level at a
+         * time rather than flattened.  Therefore, an other member rel
that is
+         * a partitioned table may have children of its own, and must
+         * therefore be marked with the appropriate lateral info so that
those
+         * children eventually get marked also.

How about: In the case of partitioned table inheritance, the original
parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
Partitions below the first level are accessible only via their immediate
parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
consider those as well.

I don't see much difference between those two. We usually do not use
macros in comments, so usually comments mention "other member" rel.
Let's leave this for the committer to judge.

In expand_inherited_rtentry(), the following comment fragment is obsolete,
because we *do* now create AppendRelInfo's for partitioned children:

+        /*
+         * We keep a list of objects in root, each of which maps a
partitioned
+         * parent RT index to the list of RT indexes of its partitioned child
+         * tables which do not have AppendRelInfos associated with those.

Good catch. I have reworded it as
/*
* We keep a list of objects in root, each of which maps a root
* partitioned parent RT index to the list of RT indexes of descendant
* partitioned child tables.

Does that look good?

By the way, when we call expand_single_inheritance_child() in the
non-partitioned inheritance case, we should pass NULL for childrte_p,
childRTindex_p, childrc_p, instead of declaring variables that won't be
used. Hence, expand_single_inheritance_child() should make those
arguments optional.

That introduces an extra "if" condition, which is costlier than an
assignment. We have used both the styles in the code. Previously, I
have got comments otherwise. So, I am not sure.

+
+    /*
+     * If the partitioned table has no partitions or all the partitions are
+     * temporary tables from other backends, treat this as non-inheritance
+     * case.
+     */
+    if (!has_child)
+        parentrte->inh = false;

I guess the above applies to all partitioned tables in the tree, so, I
think we should update the comment in set_rel_size():

else if (rte->relkind == RELKIND_PARTITIONED_TABLE)
{
/*
* A partitioned table without leaf partitions is marked
* as a dummy rel.
*/
set_dummy_rel_pathlist(rel);
}

to say: a partitioned table without partitions is marked as a dummy rel.

Done. Thanks again for the catch.

I will update the patches once we have some resolution about 1. prti
in PlanRowMarks and 2. detection of root partitioned table in
add_paths_to_append_rel().

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#214Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#212)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 11, 2017 at 2:16 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/11 16:23, Ashutosh Bapat wrote:

On Sat, Sep 9, 2017 at 6:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'm a bit suspicious about the fact that there are now executor
changes related to the PlanRowMarks. If the rowmark's prti is now the
intermediate parent's RT index rather than the top-parent's RT index,
it'd seem like that'd matter somehow. Maybe it doesn't, because the
code that cares about prti seems to only care about whether it's
different from rti. But if that's true everywhere, then why even
change this? I think we might be well off not to tinker with things
that don't need to be changed.

In the definition of ExecRowMark, I see
Index prti; /* parent range table index, if child */
It just says parent, by which I take as direct parent. For
inheritance, which earlier flattened inheritance hierarchy, direct
parent was top parent. So, probably nobody thought whether a parent is
direct parent or top parent. But now that we have introduced that
concept we need to interpret this comment anew. And I think
interpreting it as direct parent is non-lossy.

But then we also don't have anything to say about why we're making that
change. If you could describe what non-lossy is in this context, then
fine.

By setting prti to the topmost parent RTI we are loosing information
that this child may be an intermediate child similar to what we did
earlier to AppendRelInfo. That's the lossy-ness in this context.

But that we'd like to match with what we're going to do for
AppendRelInfos does not seem to be a sufficient explanation for this change.

The purpose of this patch is to change the parent-child linkages for
partitioned table and prti is one of them. So, in fact, I am wondering
why not to change that along with AppendRelInfo.

If we set top parent's
index, parent RTI in AppendRelInfo and PlanRowMark would not agree.
So, it looks quite natural that we set the direct parent's index in
PlanRowMark.

They would not agree, yes, but aren't they unrelated? If we have a reason
for them to agree, (for example, row-locking breaks in the inherited table
case if we didn't), then we should definitely make them agree.

Updating the comment for prti definition might be something that this
patch could (should?) do, but I'm not quite sure about that too.

To me that looks backwards again for the reasons described above.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#215Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#213)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

So, all partitioned partitions are getting locked correctly. Am I
missing something?

That's not a valid test. In that scenario, you're going to hold all
the locks acquired by the planner, all the locks acquired by the
rewriter, and all the locks acquired by the executor, but when using
prepared queries, it's possible to execute the plan after the planner
and rewriter locks are no longer held.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#216Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#215)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

So, all partitioned partitions are getting locked correctly. Am I
missing something?

That's not a valid test. In that scenario, you're going to hold all
the locks acquired by the planner, all the locks acquired by the
rewriter, and all the locks acquired by the executor, but when using
prepared queries, it's possible to execute the plan after the planner
and rewriter locks are no longer held.

I see the same thing when I use prepare and execute

Session 1
postgres=# prepare stmt as select 1 from t1 union all select 2 from t1;
PREPARE
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
50912
(1 row)

postgres=# begin;
BEGIN
postgres=# execute stmt;
?column?
----------
(0 rows)

Session 2
postgres=# select locktype, relation::regclass, virtualxid,
virtualtransaction, pid, mode, granted, fastpath from pg_locks;
locktype | relation | virtualxid | virtualtransaction | pid |
mode | granted | fastpath
------------+----------+------------+--------------------+-------+-----------------+---------+----------
relation | pg_locks | | 4/4 | 50914 |
AccessShareLock | t | t
virtualxid | | 4/4 | 4/4 | 50914 |
ExclusiveLock | t | t
relation | t1p1p1 | | 3/12 | 50912 |
AccessShareLock | t | t
relation | t1p1 | | 3/12 | 50912 |
AccessShareLock | t | t
relation | t1 | | 3/12 | 50912 |
AccessShareLock | t | t
virtualxid | | 3/12 | 3/12 | 50912 |
ExclusiveLock | t | t
(6 rows)

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#217Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#216)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 11, 2017 at 8:07 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I see the same thing when I use prepare and execute

Hmm. Well, that's good, but it doesn't prove there's no bug. We have
to understand where and why it's getting locked to know whether the
behavior will be correct in all cases. I haven't had time to look at
Amit's comments in detail yet so I don't know whether I agree with his
analysis or not, but we have to look at what's going on under the hood
to know whether the engine is working -- not just listen to the noise
it makes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#218Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#213)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/11 19:45, Ashutosh Bapat wrote:

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

You patch does 2 to be able to do 1, doesn't it? That is, to be able to
set the child PlanRowMark's prti to the direct parent's RT index, you pass
the immediate parent's PlanRowMark to the recursive call of
expand_single_inheritance_child().

All I am trying to say is that this patch's mission is to expand
inheritance step-wise to be able to do certain things in the *planner*
that weren't possible before. The patch accomplishes that by creating
child AppendRelInfos such that its parent_relid field is set to the
immediate parent's RT index. It's quite clear why we're doing so. It's
not clear why we should do so for PlanRowMarks too. Maybe it's fine as
long as nothing breaks.

If we go with your patch, partitioned tables won't get locked, for
example, in case of the following query (p is a partitioned table):

select 1 from p union all select 2 from p;

That's because the RelOptInfos for the two instances of p in the above
query are RELOPT_OTHER_MEMBER_REL, not RELOPT_BASEREL. They are children
of the Append corresponding to the UNION ALL subquery RTE. So,
partitioned_rels does not get set per your proposed code.

[...]

So, all partitioned partitions are getting locked correctly. Am I
missing something?

Will reply to this separately to your other email.

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?

Yeah, I guess that could work. We'll just have to write comments to
describe why the Assert is written that way.

In create_lateral_join_info():

+ Assert(IS_SIMPLE_REL(brel));
+ Assert(brte);

The second Assert is either unnecessary or should be placed first.

simple_rte_array[] may have some NULL entries. Second assert makes
sure that we aren't dealing with a NULL entry. Any particular reason
to reorder the asserts?

Sorry, I missed that the 2nd Assert has b"rte". I thought it's b"rel".

The following comment could be made a bit clearer.

+         * In the case of table inheritance, the parent RTE is directly
linked
+         * to every child table via an AppendRelInfo.  In the case of table
+         * partitioning, the inheritance hierarchy is expanded one level at a
+         * time rather than flattened.  Therefore, an other member rel
that is
+         * a partitioned table may have children of its own, and must
+         * therefore be marked with the appropriate lateral info so that
those
+         * children eventually get marked also.

How about: In the case of partitioned table inheritance, the original
parent RTE is linked, via AppendRelInfo, only to its immediate partitions.
Partitions below the first level are accessible only via their immediate
parent's RelOptInfo, which would be of kind RELOPT_OTHER_MEMBER_REL, so
consider those as well.

I don't see much difference between those two. We usually do not use
macros in comments, so usually comments mention "other member" rel.
Let's leave this for the committer to judge.

Sure.

In expand_inherited_rtentry(), the following comment fragment is obsolete,
because we *do* now create AppendRelInfo's for partitioned children:

+        /*
+         * We keep a list of objects in root, each of which maps a
partitioned
+         * parent RT index to the list of RT indexes of its partitioned child
+         * tables which do not have AppendRelInfos associated with those.

Good catch. I have reworded it as
/*
* We keep a list of objects in root, each of which maps a root
* partitioned parent RT index to the list of RT indexes of descendant
* partitioned child tables.

Does that look good?

Looks fine.

By the way, when we call expand_single_inheritance_child() in the
non-partitioned inheritance case, we should pass NULL for childrte_p,
childRTindex_p, childrc_p, instead of declaring variables that won't be
used. Hence, expand_single_inheritance_child() should make those
arguments optional.

That introduces an extra "if" condition, which is costlier than an
assignment. We have used both the styles in the code. Previously, I
have got comments otherwise. So, I am not sure.

OK. expand_single_inheritance_child's header comment does not mention the
new result fields. Maybe add a comment describing what their role is and
that they're not optional arguments.

I will update the patches once we have some resolution about 1. prti
in PlanRowMarks and 2. detection of root partitioned table in
add_paths_to_append_rel().

OK.

About 2, I somewhat agree with your proposed solution above, which might
be simpler to explain in comments than the code I proposed.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#219Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#218)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/11 19:45, Ashutosh Bapat wrote:

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

You patch does 2 to be able to do 1, doesn't it? That is, to be able to
set the child PlanRowMark's prti to the direct parent's RT index, you pass
the immediate parent's PlanRowMark to the recursive call of
expand_single_inheritance_child().

No. child PlanRowMark's prti is set to parentRTIndex, which is a
separate argument and is used to also set parent_relid in
AppendRelInfo.

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?

Yeah, I guess that could work. We'll just have to write comments to
describe why the Assert is written that way.

By the way, when we call expand_single_inheritance_child() in the
non-partitioned inheritance case, we should pass NULL for childrte_p,
childRTindex_p, childrc_p, instead of declaring variables that won't be
used. Hence, expand_single_inheritance_child() should make those
arguments optional.

That introduces an extra "if" condition, which is costlier than an
assignment. We have used both the styles in the code. Previously, I
have got comments otherwise. So, I am not sure.

OK. expand_single_inheritance_child's header comment does not mention the
new result fields. Maybe add a comment describing what their role is and
that they're not optional arguments.

I will update the patches once we have some resolution about 1. prti
in PlanRowMarks and 2. detection of root partitioned table in
add_paths_to_append_rel().

OK.

About 2, I somewhat agree with your proposed solution above, which might
be simpler to explain in comments than the code I proposed.

After testing a few queries I am getting a feeling that
ExecLockNonLeafAppendTables isn't really locking anything. I will
write more about that in my reply to Robert's mail.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#220Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#216)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/11 21:07, Ashutosh Bapat wrote:

On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

So, all partitioned partitions are getting locked correctly. Am I
missing something?

That's not a valid test. In that scenario, you're going to hold all
the locks acquired by the planner, all the locks acquired by the
rewriter, and all the locks acquired by the executor, but when using
prepared queries, it's possible to execute the plan after the planner
and rewriter locks are no longer held.

I see the same thing when I use prepare and execute

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself. Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#221Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#220)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/11 21:07, Ashutosh Bapat wrote:

On Mon, Sep 11, 2017 at 5:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 11, 2017 at 6:45 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

So, all partitioned partitions are getting locked correctly. Am I
missing something?

That's not a valid test. In that scenario, you're going to hold all
the locks acquired by the planner, all the locks acquired by the
rewriter, and all the locks acquired by the executor, but when using
prepared queries, it's possible to execute the plan after the planner
and rewriter locks are no longer held.

I see the same thing when I use prepare and execute

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself. Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Yes, I did the same analysis (to which I refer in my earlier reply to
you). I too think we should just remove partitioned_rels from Append
paths. But then the question is those are then transferred to
ModifyTable node in create_modifytable_plan() and use it for something
else. What should we do about that code? I don't think we are really
using that list from ModifyTable node as well, so may be we could
remove it from there as well. What do you think? Does that mean
partitioned_rels isn't used at all in the code?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#222Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#221)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/12 16:55, Ashutosh Bapat wrote:

On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself. Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Yes, I did the same analysis (to which I refer in my earlier reply to
you). I too think we should just remove partitioned_rels from Append
paths. But then the question is those are then transferred to
ModifyTable node in create_modifytable_plan() and use it for something
else. What should we do about that code? I don't think we are really
using that list from ModifyTable node as well, so may be we could
remove it from there as well. What do you think? Does that mean
partitioned_rels isn't used at all in the code?

No, we cannot simply get rid of partitioned_rels altogether. We'll need
to keep it in the ModifyTable node, because we *do* need the
nonleafResultRelations list in PlannedStmt to distinguish partitioned
table result relations, which set_plan_refs builds by concatenating
partitioned_rels lists of various ModifyTable nodes of the query. The
PlannedStmt.nonleafResultRelations list actually has some use (which
parallels PlannedStmt.resultRelations), but partitioned_rels list in the
individual (Merge)Append, as it turns out, doesn't.

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#223Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#219)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/12 16:39, Ashutosh Bapat wrote:

On Tue, Sep 12, 2017 at 7:31 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/11 19:45, Ashutosh Bapat wrote:

On Mon, Sep 11, 2017 at 12:16 PM, Amit Langote wrote:

IMHO, we should make it the responsibility of the future patch to set a
child PlanRowMark's prti to the direct parent's RT index, when we actually
know that it's needed for something. We clearly know today why we need to
pass the other objects like child RT entry, RT index, and Relation, so we
should limit this patch to pass only those objects to the recursive call.
That makes this patch a relatively easy to understand change.

I think you are mixing two issues here 1. setting parent RTI in child
PlanRowMark and 2. passing immediate parent's PlanRowMark to
expand_single_inheritance_child().

I have discussed 1 in my reply to Robert.

About 2 you haven't given any particular comments to my reply. To me
it looks like it's this patch that introduces the notion of
multi-level expansion, so it's natural for this patch to pass
PlanRowMark in cascaded fashion similar to other structures.

You patch does 2 to be able to do 1, doesn't it? That is, to be able to
set the child PlanRowMark's prti to the direct parent's RT index, you pass
the immediate parent's PlanRowMark to the recursive call of
expand_single_inheritance_child().

No. child PlanRowMark's prti is set to parentRTIndex, which is a
separate argument and is used to also set parent_relid in
AppendRelInfo.

OK. So, to keep the old behavior (if at all), we'd actually need a new
argument rootParentRTindex. Old behavior being that all child
PlanRowMarks has the rootParentRTindex as their prti.

It seems though that the new behavior where prti will now be set to the
direct parent's RT index is more or less harmless, because whatever we set
prti to, as long as it's different from rti, we can consider it a child
PlanRowMark. So it might be fine to set prti to direct parent's RT index.

That said, I noticed that we might need to be careful about what the value
of the root parent's PlanRowMark's allMarkType field gets set to. We need
to make sure that it reflects markType of all partitions in the tree,
including those that are not root parent's direct children. Is that true
with the proposed implementation?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#224Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#222)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/12 16:55, Ashutosh Bapat wrote:

On Tue, Sep 12, 2017 at 1:16 PM, Amit Langote wrote:

So I looked at this a bit closely and came to the conclusion that we may
not need to keep partitioned table RT indexes in the
(Merge)Append.partitioned_rels after all, as far as execution-time locking
is concerned.

Consider two cases:

1. Plan is created and executed in the same transaction

In this case, locks taken on the partitioned tables by the planner will
suffice.

2. Plan is executed in a different transaction from the one in which it
was created (a cached plan)

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Moreover, because all the tables from plannedstmt->rtable, including the
partitioned tables, will be added to PlannedStmt.relationsOids, any
invalidation events affecting the partitioned tables (for example,
add/remove a partition) will cause the plan involving partitioned tables
to be recreated.

In none of this do we rely on the partitioned table RT indexes appearing
in the (Merge)Append node itself. Maybe, we should just remove
partitioned_rels from (Merge)AppendPath and (Merge)Append node in a
separate patch and move on.

Thoughts?

Yes, I did the same analysis (to which I refer in my earlier reply to
you). I too think we should just remove partitioned_rels from Append
paths. But then the question is those are then transferred to
ModifyTable node in create_modifytable_plan() and use it for something
else. What should we do about that code? I don't think we are really
using that list from ModifyTable node as well, so may be we could
remove it from there as well. What do you think? Does that mean
partitioned_rels isn't used at all in the code?

No, we cannot simply get rid of partitioned_rels altogether. We'll need
to keep it in the ModifyTable node, because we *do* need the
nonleafResultRelations list in PlannedStmt to distinguish partitioned
table result relations, which set_plan_refs builds by concatenating
partitioned_rels lists of various ModifyTable nodes of the query. The
PlannedStmt.nonleafResultRelations list actually has some use (which
parallels PlannedStmt.resultRelations), but partitioned_rels list in the
individual (Merge)Append, as it turns out, doesn't.

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Don't we need partitioned_rels from Append paths to be transferred to
ModifyTable node or we have a different way of calculating
nonleafResultRelations?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#225Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#224)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/12 17:53, Ashutosh Bapat wrote:

On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Don't we need partitioned_rels from Append paths to be transferred to
ModifyTable node or we have a different way of calculating
nonleafResultRelations?

No, we don't transfer partitioned_rels from Append path to ModifyTable
node. inheritance_planner(), that builds the ModifyTable path for
UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
root->pcinfo_list itself and passes it to create_modifytable_path. No
Append path is involved in that case. PlannedStmt.nonleafResultRelations
is built by concatenating the partitioned_rels lists of all ModifyTable
nodes appearing in the query. It does not depend on Append's or
AppendPath's partitioned_rels.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#226Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#223)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

That said, I noticed that we might need to be careful about what the value
of the root parent's PlanRowMark's allMarkType field gets set to. We need
to make sure that it reflects markType of all partitions in the tree,
including those that are not root parent's direct children. Is that true
with the proposed implementation?

Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
top parent's PlanRowMarks should have all descendant's allMarkTypes,
which is not happening in the patch right now. There are two ways to
fix that.

1. Pass top parent's PlanRowMark all the way down to the leaf
partitions, so that current expand_single_inheritance_child() collects
allMarkTypes of all children correctly. But this way, PlanRowMarks of
intermediate parent does not reflect allMarkTypes of its children,
only top root records that.
2. Pass immediate parent's PlanRowMark to
expand_single_inheritance_child(), so that it records allMarkTypes of
its children. In expand_partitioned_rtentry() have following sequence

expand_single_inheritance_child(root, parentrte, parentRTindex,
parentrel, parentrc, childrel,
appinfos, &childrte, &childRTindex,
&childrc);

/* If this child is itself partitioned, recurse */
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
expand_partitioned_rtentry(root, childrte, childRTindex,
childrel, childrc, lockmode, appinfos,
partitioned_child_rels);

/* Include child's rowmark type in parent's allMarkTypes */
parentrc->allMarkTypes |= childrc->allMarkTypes;
}
so that we push allMarkTypes up the hierarchy.

I like the second way, since every intermediate parent records
allMarkTypes of its descendants.

Thoughts?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#227Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#226)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/12 18:49, Ashutosh Bapat wrote:

On Tue, Sep 12, 2017 at 2:17 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

That said, I noticed that we might need to be careful about what the value
of the root parent's PlanRowMark's allMarkType field gets set to. We need
to make sure that it reflects markType of all partitions in the tree,
including those that are not root parent's direct children. Is that true
with the proposed implementation?

Yes. We include child's allMarkTypes into parent's allMarkTypes. So,
top parent's PlanRowMarks should have all descendant's allMarkTypes,
which is not happening in the patch right now. There are two ways to
fix that.

1. Pass top parent's PlanRowMark all the way down to the leaf
partitions, so that current expand_single_inheritance_child() collects
allMarkTypes of all children correctly. But this way, PlanRowMarks of
intermediate parent does not reflect allMarkTypes of its children,
only top root records that.
2. Pass immediate parent's PlanRowMark to
expand_single_inheritance_child(), so that it records allMarkTypes of
its children. In expand_partitioned_rtentry() have following sequence

expand_single_inheritance_child(root, parentrte, parentRTindex,
parentrel, parentrc, childrel,
appinfos, &childrte, &childRTindex,
&childrc);

/* If this child is itself partitioned, recurse */
if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
{
expand_partitioned_rtentry(root, childrte, childRTindex,
childrel, childrc, lockmode, appinfos,
partitioned_child_rels);

/* Include child's rowmark type in parent's allMarkTypes */
parentrc->allMarkTypes |= childrc->allMarkTypes;
}
so that we push allMarkTypes up the hierarchy.

I like the second way, since every intermediate parent records
allMarkTypes of its descendants.

I like the second way, too.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#228Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#225)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 12, 2017 at 2:35 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/12 17:53, Ashutosh Bapat wrote:

On Tue, Sep 12, 2017 at 1:42 PM, Amit Langote wrote:

So, we can remove partitioned_rels from (Merge)AppendPath and
(Merge)Append nodes and remove ExecLockNonLeafAppendTables().

Don't we need partitioned_rels from Append paths to be transferred to
ModifyTable node or we have a different way of calculating
nonleafResultRelations?

No, we don't transfer partitioned_rels from Append path to ModifyTable
node. inheritance_planner(), that builds the ModifyTable path for
UPDATE/DELETE on a partitioned table, fetches partitioned_rels from
root->pcinfo_list itself and passes it to create_modifytable_path. No
Append path is involved in that case. PlannedStmt.nonleafResultRelations
is built by concatenating the partitioned_rels lists of all ModifyTable
nodes appearing in the query. It does not depend on Append's or
AppendPath's partitioned_rels.

Ok. Thanks for the explanation.

This make me examine inheritance_planner() closely and I think I have
spotted a thinko there. In inheritance_planner() parent_rte is set to
the RTE of parent to start with and then in the loop
1132 /*
1133 * And now we can get on with generating a plan for each child table.
1134 */
1135 foreach(lc, root->append_rel_list)
1136 {
... code clipped
1165 /*
1166 * If there are securityQuals attached to the parent,
move them to the
1167 * child rel (they've already been transformed properly for that).
1168 */
1169 parent_rte = rt_fetch(parentRTindex, subroot->parse->rtable);
1170 child_rte = rt_fetch(appinfo->child_relid, subroot->parse->rtable);
1171 child_rte->securityQuals = parent_rte->securityQuals;
1172 parent_rte->securityQuals = NIL;

we set parent_rte to the one obtained from subroot->parse, which
happens to be the same (at least in contents) as original parent_rte.
Later we use this parent_rte to pull partitioned_rels outside that
loop

1371 if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE)
1372 {
1373 partitioned_rels = get_partitioned_child_rels(root, parentRTindex);
1374 /* The root partitioned table is included as a child rel */
1375 Assert(list_length(partitioned_rels) >= 1);
1376 }

I think the code here expects the original parent_rte and not the one
we set around line 1169.

This isn't a bug right now, since both the parent_rte s have same
content. But I am not sure if that will remain to be so. Here's patch
to fix the thinko.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

inh_planner_prte.patchapplication/octet-stream; name=inh_planner_prte.patch
#229Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#220)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Hmm. The problem with this theory in my view is that it doesn't
explain why InitPlan() and ExecOpenScanRelation() lock the relations
instead of just assuming that they are already locked either by
AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables()
doesn't really need to take locks, then ExecOpenScanRelation() must
not need to do it either. We invented ExecLockNonLeafAppendTables()
on the occasion of removing the scans of those tables which would
previously have caused ExecOpenScanRelation() to be invoked, so as to
keep the locking behavior unchanged.

AcquireExecutorLocks() looks like an odd bit of code to me. The
executor itself locks result tables in InitPlan() and then everything
else during InitPlan() and all of the others later on while walking
the plan tree -- comments in InitPlan() say that this is to avoid a
lock upgrade hazard if a result rel is also a source rel. But
AcquireExecutorLocks() has no such provision; it just locks everything
in RTE order. In theory, that's a deadlock hazard of another kind, as
we just talked about in the context of EIBO. In fact, expanding in
bound order has made the situation worse: before, expansion order and
locking order were the same, so maybe having AcquireExecutorLocks()
work in RTE order coincidentally happened to give the same result as
the executor code itself as long as there are no result relations.
But this is certainly not true any more. I'm not sure it's worth
expending a lot of time on this -- it's evidently not a problem in
practice, or somebody probably would've complained before now.

But that having been said, I don't think we should assume that all the
locks taken from the executor are worthless because plancache.c will
always do the job for us. I don't know of a case where we execute a
saved plan without going through the plan cache, but that doesn't mean
that there isn't one or that there couldn't be one in the future.
It's not the job of these partitioning patches to whack around the way
we do locking in general -- they should preserve the existing behavior
as much as possible. If we want to get rid of the locking in the
executor altogether, that's a separate discussion where, I have a
feeling, there will prove to be better reasons for the way things are
than we are right now supposing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#230Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#228)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/12 19:56, Ashutosh Bapat wrote:

I think the code here expects the original parent_rte and not the one
we set around line 1169.

This isn't a bug right now, since both the parent_rte s have same
content. But I am not sure if that will remain to be so. Here's patch
to fix the thinko.

Instead of the new bool is_parent_partitioned, why not move the code to
set partitioned_rels to the block where you're now setting
is_parent_partitioned.

Also, since we know this isn't a bug at the moment but will turn into one
once we have step-wise expansion, why not include this fix in that patch
itself?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#231Amit Khandekar
Amit Khandekar
amitdkhan.pg@gmail.com
In reply to: Ashutosh Bapat (#228)
Re: Partition-wise join for join between (declaratively) partitioned tables

Hi,

Rafia had done some testing on TPCH queries using Partition-wise join
patch along with Parallel Append patch.

There, we had observed that for query 4, even though the partition
wise joins are under a Parallel Append, the join are all non-partial.

Specifically, the partition-wise join has non-partial nested loop
joins when actually it was expected to have partial nested loop joins.
(The difference can be seen by the observation that the outer relation
of that join is scanned by non-parallel Bitmap Heap scan when it
should have used Parallel Bitmap Heap Scan).

Here is the detailed analysis , including where I think is the issue :

/messages/by-id/CAJ3gD9cZms1ND3p=NN=hDYDFt_SeKq1htMBhbj85bOmvJwY5fg@mail.gmail.com

All the TPCH results are posted in the same above mail thread.

Thanks
-Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#232Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#229)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Hmm. The problem with this theory in my view is that it doesn't
explain why InitPlan() and ExecOpenScanRelation() lock the relations
instead of just assuming that they are already locked either by
AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables()
doesn't really need to take locks, then ExecOpenScanRelation() must
not need to do it either. We invented ExecLockNonLeafAppendTables()
on the occasion of removing the scans of those tables which would
previously have caused ExecOpenScanRelation() to be invoked, so as to
keep the locking behavior unchanged.

AcquireExecutorLocks() looks like an odd bit of code to me. The
executor itself locks result tables in InitPlan() and then everything
else during InitPlan() and all of the others later on while walking
the plan tree -- comments in InitPlan() say that this is to avoid a
lock upgrade hazard if a result rel is also a source rel. But
AcquireExecutorLocks() has no such provision; it just locks everything
in RTE order. In theory, that's a deadlock hazard of another kind, as
we just talked about in the context of EIBO. In fact, expanding in
bound order has made the situation worse: before, expansion order and
locking order were the same, so maybe having AcquireExecutorLocks()
work in RTE order coincidentally happened to give the same result as
the executor code itself as long as there are no result relations.
But this is certainly not true any more. I'm not sure it's worth
expending a lot of time on this -- it's evidently not a problem in
practice, or somebody probably would've complained before now.

But that having been said, I don't think we should assume that all the
locks taken from the executor are worthless because plancache.c will
always do the job for us. I don't know of a case where we execute a
saved plan without going through the plan cache, but that doesn't mean
that there isn't one or that there couldn't be one in the future.
It's not the job of these partitioning patches to whack around the way
we do locking in general -- they should preserve the existing behavior
as much as possible. If we want to get rid of the locking in the
executor altogether, that's a separate discussion where, I have a
feeling, there will prove to be better reasons for the way things are
than we are right now supposing.

I agree that it's not the job of these patches to change the locking
or even get rid of partitioned_rels. In order to continue returning
partitioned_rels in Append paths esp. in the case of queries involving
set operations and partitioned table e.g "select 1 from t1 union all
select 2 from t1;" in which t1 is multi-level partitioned table, we
need a fix in add_paths_to_append_rels(). The fix provided in [1]/messages/by-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d7a7@lab.ntt.co.jp is
correct but we will need a longer explanation of why we have to
involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
is complicated. If we get rid of partitioned_rels, we don't need to
fix that code in add_paths_to_append_rel().

I suggested that [2]/messages/by-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
-- (excerpt from [2]/messages/by-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company)

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?
--

Amit Langote agrees with this. It kind of makes the assertion lame but
keeps the code sane. What do you think?

[1]: /messages/by-id/d2f1cdcb-ebb4-76c5-e471-79348ca5d7a7@lab.ntt.co.jp
[2]: /messages/by-id/CAFjFpRfJ3GRRmmOugaMA-q4i=se5P6yjZ_C6A6HDRDQQTGXy1A@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#233Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Langote (#230)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 13, 2017 at 11:29 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/12 19:56, Ashutosh Bapat wrote:

I think the code here expects the original parent_rte and not the one
we set around line 1169.

This isn't a bug right now, since both the parent_rte s have same
content. But I am not sure if that will remain to be so. Here's patch
to fix the thinko.

Instead of the new bool is_parent_partitioned, why not move the code to
set partitioned_rels to the block where you're now setting
is_parent_partitioned.

Also, since we know this isn't a bug at the moment but will turn into one
once we have step-wise expansion, why not include this fix in that patch
itself?

It won't turn into a bug with step-wise expansion since every
parent_rte will have RELKIND_PARTITIONED_TABLE for a partitioned top
parent, which is used to extract the partitioned_rels. But I guess,
it's better to fix the thinko in step-wise expansion since parent_rte
itself changes.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#234Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Amit Khandekar (#231)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Hi,

Rafia had done some testing on TPCH queries using Partition-wise join
patch along with Parallel Append patch.

There, we had observed that for query 4, even though the partition
wise joins are under a Parallel Append, the join are all non-partial.

Specifically, the partition-wise join has non-partial nested loop
joins when actually it was expected to have partial nested loop joins.
(The difference can be seen by the observation that the outer relation
of that join is scanned by non-parallel Bitmap Heap scan when it
should have used Parallel Bitmap Heap Scan).

Here is the detailed analysis , including where I think is the issue :

/messages/by-id/CAJ3gD9cZms1ND3p=NN=hDYDFt_SeKq1htMBhbj85bOmvJwY5fg@mail.gmail.com

All the TPCH results are posted in the same above mail thread.

Can you please check if the attached patch fixes the issue.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

reparameterize_partial_nestloop_inner.patchtext/x-patch; charset=US-ASCII; name=reparameterize_partial_nestloop_inner.patch
#235Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ashutosh Bapat (#232)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/13 16:21, Ashutosh Bapat wrote:

On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

locks taken from the executor are worthless because plancache.c will
always do the job for us. I don't know of a case where we execute a
saved plan without going through the plan cache, but that doesn't mean
that there isn't one or that there couldn't be one in the future.
It's not the job of these partitioning patches to whack around the way
we do locking in general -- they should preserve the existing behavior
as much as possible. If we want to get rid of the locking in the
executor altogether, that's a separate discussion where, I have a
feeling, there will prove to be better reasons for the way things are
than we are right now supposing.

I agree that it's not the job of these patches to change the locking
or even get rid of partitioned_rels. In order to continue returning
partitioned_rels in Append paths esp. in the case of queries involving
set operations and partitioned table e.g "select 1 from t1 union all
select 2 from t1;" in which t1 is multi-level partitioned table, we
need a fix in add_paths_to_append_rels(). The fix provided in [1] is
correct but we will need a longer explanation of why we have to
involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
is complicated. If we get rid of partitioned_rels, we don't need to
fix that code in add_paths_to_append_rel().

Yeah, let's get on with setting partitioned_rels in AppendPath correctly
in this patch. Ashutosh's suggested approach seems fine, although it
needlessly requires to scan root->pcinfo_list. But it shouldn't be longer
than the number of partitioned tables in the query, so maybe that's fine
too. At least, it doesn't require us to add code to
add_paths_to_append_rel() that can be pretty hard to wrap one's head around.

That said, we might someday need to look carefully at some things that
Robert mentioned carefully, especially around the order of locks taken by
AcquireExecutorLocks() in light of the EIBO patch getting committed.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#236Amit Khandekar
Amit Khandekar
amitdkhan.pg@gmail.com
In reply to: Ashutosh Bapat (#234)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 13 September 2017 at 13:05, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Sep 13, 2017 at 12:32 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Hi,

Rafia had done some testing on TPCH queries using Partition-wise join
patch along with Parallel Append patch.

There, we had observed that for query 4, even though the partition
wise joins are under a Parallel Append, the join are all non-partial.

Specifically, the partition-wise join has non-partial nested loop
joins when actually it was expected to have partial nested loop joins.
(The difference can be seen by the observation that the outer relation
of that join is scanned by non-parallel Bitmap Heap scan when it
should have used Parallel Bitmap Heap Scan).

Here is the detailed analysis , including where I think is the issue :

/messages/by-id/CAJ3gD9cZms1ND3p=NN=hDYDFt_SeKq1htMBhbj85bOmvJwY5fg@mail.gmail.com

All the TPCH results are posted in the same above mail thread.

Can you please check if the attached patch fixes the issue.

Thanks Ashutosh. Yes, it does fix the issue. Partial Nested Loop joins
are generated now. If I see any unexpected differences in the
estimated or actual costs, I will report that in the Parallel Append
thread. As far as Partition-wise join is concerned, this issue is
solved, because Partial nested loop join does get created.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#237Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#232)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 13, 2017 at 12:51 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Sep 13, 2017 at 12:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 12, 2017 at 3:46 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

In this case, AcquireExecutorLocks will lock all the relations in
PlannedStmt.rtable, which must include all partitioned tables of all
partition trees involved in the query. Of those, it will lock the tables
whose RT indexes appear in PlannedStmt.nonleafResultRelations with
RowExclusiveLock mode. PlannedStmt.nonleafResultRelations is a global
list of all partitioned table RT indexes obtained by concatenating
partitioned_rels lists of all ModifyTable nodes involved in the query
(set_plan_refs does that). We need to distinguish nonleafResultRelations,
because we need to take the stronger lock on a given table before any
weaker one if it happens to appear in the query as a non-result relation
too, to avoid lock strength upgrade deadlock hazard.

Hmm. The problem with this theory in my view is that it doesn't
explain why InitPlan() and ExecOpenScanRelation() lock the relations
instead of just assuming that they are already locked either by
AcquireExecutorLocks or by planning. If ExecLockNonLeafAppendTables()
doesn't really need to take locks, then ExecOpenScanRelation() must
not need to do it either. We invented ExecLockNonLeafAppendTables()
on the occasion of removing the scans of those tables which would
previously have caused ExecOpenScanRelation() to be invoked, so as to
keep the locking behavior unchanged.

AcquireExecutorLocks() looks like an odd bit of code to me. The
executor itself locks result tables in InitPlan() and then everything
else during InitPlan() and all of the others later on while walking
the plan tree -- comments in InitPlan() say that this is to avoid a
lock upgrade hazard if a result rel is also a source rel. But
AcquireExecutorLocks() has no such provision; it just locks everything
in RTE order. In theory, that's a deadlock hazard of another kind, as
we just talked about in the context of EIBO. In fact, expanding in
bound order has made the situation worse: before, expansion order and
locking order were the same, so maybe having AcquireExecutorLocks()
work in RTE order coincidentally happened to give the same result as
the executor code itself as long as there are no result relations.
But this is certainly not true any more. I'm not sure it's worth
expending a lot of time on this -- it's evidently not a problem in
practice, or somebody probably would've complained before now.

But that having been said, I don't think we should assume that all the
locks taken from the executor are worthless because plancache.c will
always do the job for us. I don't know of a case where we execute a
saved plan without going through the plan cache, but that doesn't mean
that there isn't one or that there couldn't be one in the future.
It's not the job of these partitioning patches to whack around the way
we do locking in general -- they should preserve the existing behavior
as much as possible. If we want to get rid of the locking in the
executor altogether, that's a separate discussion where, I have a
feeling, there will prove to be better reasons for the way things are
than we are right now supposing.

I agree that it's not the job of these patches to change the locking
or even get rid of partitioned_rels. In order to continue returning
partitioned_rels in Append paths esp. in the case of queries involving
set operations and partitioned table e.g "select 1 from t1 union all
select 2 from t1;" in which t1 is multi-level partitioned table, we
need a fix in add_paths_to_append_rels(). The fix provided in [1] is
correct but we will need a longer explanation of why we have to
involve RTE_SUBQUERY with RELKIND_PARTITIONED_TABLE. The explanation
is complicated. If we get rid of partitioned_rels, we don't need to
fix that code in add_paths_to_append_rel().

I suggested that [2]
-- (excerpt from [2])

Actually, the original problem that caused this discussion started
with an assertion failure in get_partitioned_child_rels() as
Assert(list_length(result) >= 1);

This assertion fails if result is NIL when an intermediate partitioned
table is passed. May be we should assert (result == NIL ||
list_length(result) == 1) and allow that function to be called even
for intermediate partitioned partitions for which the function will
return NIL. That will leave the code in add_paths_to_append_rel()
simple. Thoughts?
--

Amit Langote agrees with this. It kind of makes the assertion lame but
keeps the code sane. What do you think?

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

#0 add_paths_to_append_rel (root=0x23e4308, rel=0x23fb768,
live_childrels=0x23ff5f0) at allpaths.c:1281
#1 0x000000000076e170 in set_append_rel_pathlist (root=0x23e4308,
rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:1262
#2 0x000000000076cf23 in set_rel_pathlist (root=0x23e4308,
rel=0x23fb768, rti=4, rte=0x23f3268) at allpaths.c:431
#3 0x000000000076e0f6 in set_append_rel_pathlist (root=0x23e4308,
rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:1247
#4 0x000000000076cf23 in set_rel_pathlist (root=0x23e4308,
rel=0x23fb478, rti=1, rte=0x2382070) at allpaths.c:431
#5 0x000000000076cc22 in set_base_rel_pathlists (root=0x23e4308) at
allpaths.c:309

When add_paths_to_append_rel() (frame 0) is called for t1, it gets
partitioned_rels and stuffs it in append path/s it creates. But those
paths are flattened into the append paths created for the set
operations when add_paths_to_append_rels() is called from frame 3.
While flattening the append paths in accumulate_append_subpath() we do
not pull any partitioned_rels that are stuffed in those paths and thus
the final append path/s created does not have partitioned_rels in
there.

The same behaviour is retained by my v30 patchset [1]/messages/by-id/CAFjFpRfHkJW3G=_PnSUc6PbXJE48AWYwyRzaGqtfKzzoU4wXXw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company. I think we
should go ahead by fixing add_paths_to_append_rel() as done in that
patchset. partitioned_rels needs to be removed from append paths
anyway, so that code will be removed when we do that.

[1]: /messages/by-id/CAFjFpRfHkJW3G=_PnSUc6PbXJE48AWYwyRzaGqtfKzzoU4wXXw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#238Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#237)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

Nice debugging. I spent some time today looking at this and I think
it's a bug in v10, and specifically in add_paths_to_append_rel(),
which only sets partitioned_rels correctly when the appendrel is a
partitioned rel, and not when it's a subquery RTE with one or more
partitioned queries beneath it.

Attached are two patches either one of which will fix it. First, I
wrote mechanical-partrels-fix.patch, which just mechanically
propagates partitioned_rels lists from accumulated subpaths into the
list used to construct the parent (Merge)AppendPath. I wasn't entire
happy with that, because it ends up building multiple partitioned_rels
lists for the same RelOptInfo. That seems silly, but there's no
principled way to avoid it; avoiding it amounts to hoping that all the
paths for the same relation carry the same partitioned_rels list,
which is uncomfortable.

So then I wrote pcinfo-for-subquery.patch. That patch notices when an
RTE_SUBQUERY appendrel is processed and accumulates the
partitioned_rels of its immediate children; in case there can be
multiple nested levels of subqueries before we get down to the actual
partitioned rel, it also adds a PartitionedChildRelInfo for the
subquery RTE, so that there's no need to walk the whole tree to build
the partitioned_rels list at higher levels, just the immediate
children. I find this fix a lot more satisfying. It adds less code
and does no extra work in the common case.

Notice that the choice of fix we adopt has consequences for your
0001-Multi-level-partitioned-table-expansion.patch -- with
mechanical-partrels-fix.patch, that patch could either associated all
partitioned_rels with the top-parent or it could work level by level
and everything would get properly assembled later. But with
pcinfo-for-subquery.patch, we need everything associated with the
top-parent. That doesn't seem like a problem to me, but it's
something to note.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

mechanical-partrels-fix.patchapplication/octet-stream; name=mechanical-partrels-fix.patch
pcinfo-for-subquery.patchapplication/octet-stream; name=pcinfo-for-subquery.patch
#239Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#238)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/14 7:43, Robert Haas wrote:

On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

Nice debugging.

+1.

I spent some time today looking at this and I think
it's a bug in v10, and specifically in add_paths_to_append_rel(),
which only sets partitioned_rels correctly when the appendrel is a
partitioned rel, and not when it's a subquery RTE with one or more
partitioned queries beneath it.

Attached are two patches either one of which will fix it. First, I
wrote mechanical-partrels-fix.patch, which just mechanically
propagates partitioned_rels lists from accumulated subpaths into the
list used to construct the parent (Merge)AppendPath. I wasn't entire
happy with that, because it ends up building multiple partitioned_rels
lists for the same RelOptInfo. That seems silly, but there's no
principled way to avoid it; avoiding it amounts to hoping that all the
paths for the same relation carry the same partitioned_rels list,
which is uncomfortable.

So then I wrote pcinfo-for-subquery.patch. That patch notices when an
RTE_SUBQUERY appendrel is processed and accumulates the
partitioned_rels of its immediate children; in case there can be
multiple nested levels of subqueries before we get down to the actual
partitioned rel, it also adds a PartitionedChildRelInfo for the
subquery RTE, so that there's no need to walk the whole tree to build
the partitioned_rels list at higher levels, just the immediate
children. I find this fix a lot more satisfying. It adds less code
and does no extra work in the common case.

I very much like pcinfo-for-subquery.patch, although I'm not sure if we
need to create PartitionedChildRelInfo for the sub-query parent RTE as the
patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL
subqueries are flattened way before we get to add_paths_to_append_rel();
if it could not be flattened, there wouldn't be a call to
add_paths_to_append_rel() in the first place, because no AppendRelInfos
would be generated. See what happens when is_simple_union_all_recurse()
returns false to flatten_simple_union_all() -- no AppendRelInfos will be
generated and added to root->append_rel_list in that case.

IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
like we're setting out to build for multi-level partitioned tables.

So, as things stand today, there can at most be one recursive call of
add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
sub-queries contain partitioned tables, but not more. The other patch
(multi-level expansion of partitioned tables) will change that, but even
then we won't need sub-query's own PartitioendChildRelInfo.

Notice that the choice of fix we adopt has consequences for your
0001-Multi-level-partitioned-table-expansion.patch -- with
mechanical-partrels-fix.patch, that patch could either associated all
partitioned_rels with the top-parent or it could work level by level
and everything would get properly assembled later. But with
pcinfo-for-subquery.patch, we need everything associated with the
top-parent. That doesn't seem like a problem to me, but it's
something to note.

I think it's fine.

With 0001-Multi-level-partitioned-table-expansion.patch,
get_partitioned_child_rels() will get called even for non-root partitioned
tables, for which it won't find a valid pcinfo. I think that patch must
also change its callers to stop Asserting that a valid pcinfo is returned.

Spotted a typo in pcinfo-for-subquery.patch:

+ * A plain relation will alread have

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#240Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#238)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 14, 2017 at 4:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

Nice debugging. I spent some time today looking at this and I think
it's a bug in v10, and specifically in add_paths_to_append_rel(),
which only sets partitioned_rels correctly when the appendrel is a
partitioned rel, and not when it's a subquery RTE with one or more
partitioned queries beneath it.

Attached are two patches either one of which will fix it. First, I
wrote mechanical-partrels-fix.patch, which just mechanically
propagates partitioned_rels lists from accumulated subpaths into the
list used to construct the parent (Merge)AppendPath. I wasn't entire
happy with that, because it ends up building multiple partitioned_rels
lists for the same RelOptInfo. That seems silly, but there's no
principled way to avoid it; avoiding it amounts to hoping that all the
paths for the same relation carry the same partitioned_rels list,
which is uncomfortable.

So then I wrote pcinfo-for-subquery.patch. That patch notices when an
RTE_SUBQUERY appendrel is processed and accumulates the
partitioned_rels of its immediate children; in case there can be
multiple nested levels of subqueries before we get down to the actual
partitioned rel, it also adds a PartitionedChildRelInfo for the
subquery RTE, so that there's no need to walk the whole tree to build
the partitioned_rels list at higher levels, just the immediate
children. I find this fix a lot more satisfying. It adds less code
and does no extra work in the common case.

Thanks a lot for the patch. I have included pcinfo-for-subquery.patch
in my patchset as the first patch with typo corrections suggested by
Amit Langote.

Notice that the choice of fix we adopt has consequences for your
0001-Multi-level-partitioned-table-expansion.patch -- with
mechanical-partrels-fix.patch, that patch could either associated all
partitioned_rels with the top-parent or it could work level by level
and everything would get properly assembled later. But with
pcinfo-for-subquery.patch, we need everything associated with the
top-parent. That doesn't seem like a problem to me, but it's
something to note.

I have few changes to multi-level expansion patch as per discussion in
earlier mails
1. expand_single_inheritance_child() gets the top parent's PlanRowMark
from which it builds the child's PlanRowMark and also update
allMarkTypes of the top parent's PlanRowMark. The chlid's PlanRowMark
contains the RTI of the top parent, which is pulled from the top
parent's PlanRowMark. This is to keep the old behaviour intact.

2. Updated expand_single_inheritance_child's prologue to explain
various output arguments, per suggestion from Amit Langote. Also
included comments about the way we construct child PlanRowMark. Please
see if the comments look good.

3. As suggested by Amit Langote, with multi-level partitioned table
expansion, intermediate partitioned tables won't have pcinfo
associated them. So, that patch removes the assertion
Assert(list_length(partitioned_rels) >= 1) in
add_paths_to_append_rels(). I didn't remove that assertion from your
patch so that you could cherry-pick that commit to v10 where that
assertion holds true.

4. Fixed inheritance_planner() to use top parent's RTE to pull
partitioned_rels per discussion with Amit few mails back [1]/messages/by-id/CAFjFpRe62H0rTb4Rb7wOVSR25xfNW+mt1Ncp-OtzGaEtZBTLwA@mail.gmail.com.

Please let me know if I have missed anything; it's been some long discussion.

Apart from this I have included fix to reparameterize parallel nested
loop paths as per discussion in [2]/messages/by-id/CAJ3gD9ctVgv6r0-7B6js7Z5uPHXx+KA5jK-3=uFsGwKOXfTddg@mail.gmail.com.

Please note that I have removed the advanced partitioning patches from
the attached patchset since those need a rebase because of default
partition support.

[1]: /messages/by-id/CAFjFpRe62H0rTb4Rb7wOVSR25xfNW+mt1Ncp-OtzGaEtZBTLwA@mail.gmail.com
[2]: /messages/by-id/CAJ3gD9ctVgv6r0-7B6js7Z5uPHXx+KA5jK-3=uFsGwKOXfTddg@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v31.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v31.tar.gz
#241Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#239)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 13, 2017 at 10:57 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

I very much like pcinfo-for-subquery.patch, although I'm not sure if we
need to create PartitionedChildRelInfo for the sub-query parent RTE as the
patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL
subqueries are flattened way before we get to add_paths_to_append_rel();
if it could not be flattened, there wouldn't be a call to
add_paths_to_append_rel() in the first place, because no AppendRelInfos
would be generated. See what happens when is_simple_union_all_recurse()
returns false to flatten_simple_union_all() -- no AppendRelInfos will be
generated and added to root->append_rel_list in that case.

IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
like we're setting out to build for multi-level partitioned tables.

So, as things stand today, there can at most be one recursive call of
add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
sub-queries contain partitioned tables, but not more. The other patch
(multi-level expansion of partitioned tables) will change that, but even
then we won't need sub-query's own PartitioendChildRelInfo.

OK, let's assume you're correct unless some contrary evidence emerges.
Committed without that part; thanks for the review.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#242Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#240)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have few changes to multi-level expansion patch as per discussion in
earlier mails

OK, I have committed
0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
changes.

Phew, getting that sorted out has been an astonishing amount of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#243Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#242)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/09/15 4:43, Robert Haas wrote:

On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have few changes to multi-level expansion patch as per discussion in
earlier mails

OK, I have committed
0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
changes.

Phew, getting that sorted out has been an astonishing amount of work.

Yeah, thanks to both of you. Now on to other complicated stuff. :)

Regards,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#244Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Amit Langote (#239)
3 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 14, 2017 at 8:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2017/09/14 7:43, Robert Haas wrote:

On Wed, Sep 13, 2017 at 12:56 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I debugged what happens in case of query "select 1 from t1 union all
select 2 from t1;" with the current HEAD (without multi-level
expansion patch attached). It doesn't set partitioned_rels in Append
path that gets converted into Append plan. Remember t1 is a
multi-level partitioned table here with t1p1 as its immediate
partition and t1p1p1 as partition of t1p1. So, the
set_append_rel_pathlist() recurses once as shown in the following
stack trace.

Nice debugging.

+1.

I spent some time today looking at this and I think
it's a bug in v10, and specifically in add_paths_to_append_rel(),
which only sets partitioned_rels correctly when the appendrel is a
partitioned rel, and not when it's a subquery RTE with one or more
partitioned queries beneath it.

Attached are two patches either one of which will fix it. First, I
wrote mechanical-partrels-fix.patch, which just mechanically
propagates partitioned_rels lists from accumulated subpaths into the
list used to construct the parent (Merge)AppendPath. I wasn't entire
happy with that, because it ends up building multiple partitioned_rels
lists for the same RelOptInfo. That seems silly, but there's no
principled way to avoid it; avoiding it amounts to hoping that all the
paths for the same relation carry the same partitioned_rels list,
which is uncomfortable.

So then I wrote pcinfo-for-subquery.patch. That patch notices when an
RTE_SUBQUERY appendrel is processed and accumulates the
partitioned_rels of its immediate children; in case there can be
multiple nested levels of subqueries before we get down to the actual
partitioned rel, it also adds a PartitionedChildRelInfo for the
subquery RTE, so that there's no need to walk the whole tree to build
the partitioned_rels list at higher levels, just the immediate
children. I find this fix a lot more satisfying. It adds less code
and does no extra work in the common case.

I very much like pcinfo-for-subquery.patch, although I'm not sure if we
need to create PartitionedChildRelInfo for the sub-query parent RTE as the
patch teaches add_paths_to_append_rel() to do. ISTM, nested UNION ALL
subqueries are flattened way before we get to add_paths_to_append_rel();
if it could not be flattened, there wouldn't be a call to
add_paths_to_append_rel() in the first place, because no AppendRelInfos
would be generated. See what happens when is_simple_union_all_recurse()
returns false to flatten_simple_union_all() -- no AppendRelInfos will be
generated and added to root->append_rel_list in that case.

IOW, there won't be nested AppendRelInfos for nested UNION ALL sub-queries
like we're setting out to build for multi-level partitioned tables.

So, as things stand today, there can at most be one recursive call of
add_path_to_append_rel() for a sub-query parent RTE, that is, if its child
sub-queries contain partitioned tables, but not more. The other patch
(multi-level expansion of partitioned tables) will change that, but even
then we won't need sub-query's own PartitioendChildRelInfo.

Notice that the choice of fix we adopt has consequences for your
0001-Multi-level-partitioned-table-expansion.patch -- with
mechanical-partrels-fix.patch, that patch could either associated all
partitioned_rels with the top-parent or it could work level by level
and everything would get properly assembled later. But with
pcinfo-for-subquery.patch, we need everything associated with the
top-parent. That doesn't seem like a problem to me, but it's
something to note.

I think it's fine.

With 0001-Multi-level-partitioned-table-expansion.patch,
get_partitioned_child_rels() will get called even for non-root partitioned
tables, for which it won't find a valid pcinfo. I think that patch must
also change its callers to stop Asserting that a valid pcinfo is returned.

Spotted a typo in pcinfo-for-subquery.patch:

+ * A plain relation will alread have

Thanks,
Amit

On TPC-H benchmarking of this patch, I found a regression in Q7. It
was taking some 1500s with the patch and some 900s without the patch.
Please find the attached pwd_reg.zip for the output of explain analyse
on head and with patch.

The experimental settings used were,
commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
patch-version = v26

Server settings:
work_mem = 1GB
shared_buffers = 10GB
effective_cache_size = 10GB
max_parallel_workers_per_gather = 4

Partitioning information:
Partitioning scheme = by range
Number of partitions in lineitem and orders table = 106
partition key for lineitem = l_orderkey
partition key for orders = o_orderkey

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

pwj_reg.outapplication/octet-stream; name=pwj_reg.out
test_case_pwj.sqlapplication/octet-stream; name=test_case_pwj.sql
pwj_reg.zipapplication/zip; name=pwj_reg.zip
#245Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#242)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 15, 2017 at 1:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 14, 2017 at 8:06 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I have few changes to multi-level expansion patch as per discussion in
earlier mails

OK, I have committed
0002-Multi-level-partitioned-table-expansion.patch with a few cosmetic
changes.

Phew, getting that sorted out has been an astonishing amount of work.

Thanks a lot Robert.

Here are rebased patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v32.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v32.tar.gz
#246Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#244)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On TPC-H benchmarking of this patch, I found a regression in Q7. It
was taking some 1500s with the patch and some 900s without the patch.
Please find the attached pwd_reg.zip for the output of explain analyse
on head and with patch.

The experimental settings used were,
commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
patch-version = v26

Server settings:
work_mem = 1GB
shared_buffers = 10GB
effective_cache_size = 10GB
max_parallel_workers_per_gather = 4

Partitioning information:
Partitioning scheme = by range
Number of partitions in lineitem and orders table = 106
partition key for lineitem = l_orderkey
partition key for orders = o_orderkey

I observe that with partition-wise join patch the planner is using
GatherMerge along-with partition-wise join and on head its not using
GatherMerge. Just to make sure that its partition-wise join which is
causing regression and not GatherMerge, can you please run the query
with enable_gathermerge = false?

I see following lines explain analyze output 7_1.out without the patch
-> Sort (cost=84634030.40..84638520.55 rows=1796063
width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1)
Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
Sort Method: quicksort Memory: 308912kB
-> Hash Join (cost=16080591.94..84447451.72
rows=1796063 width=72) (actual time=252745.701..1057447.219
rows=1749956 loops=1)
Since Sort doesn't filter any rows, we would expect it to output the
same number of rows as hash join underneath it. But the number of rows
differ in this case. I am wondering whether there's some problem with
the explain analyze output itself.

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

Are you talking about [1]/messages/by-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com? I have explained about the regression in
[2]: /messages/by-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com

[1]: /messages/by-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com
[2]: /messages/by-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com
[3]: /messages/by-id/CAFjFpReJKSdCfaeuZjGD79hOETzpz5BKDxLJgxr7qznrXX+TRw@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#247Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#245)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Thanks a lot Robert.

Here are rebased patches.

I didn't get quite as much time to look at these today as I would have
liked, but here's what I've got so far.

Comments on 0001:

- In the RelOptInfo, part_oids is defined in a completely different
part of the structure than nparts, but you can't use it without nparts
because you don't know how long it is. I suggest moving the
definition to just after nparts.

- On the other hand, maybe we should just remove it completely. I
don't see it in any of the subsequent patches. If it's used by the
advanced matching code, let's leave it out of 0001 for now and add it
back after the basic feature is committed.

- Similarly, partsupfunc isn't used by the later patches either. It
seems it could also be removed, at least for now.

- The comment for partexprs isn't very clear about how the lists
inside the array work. My understanding is that the lists have as
many members as the partition key has columns/expressions.

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

Comments on 0002:

- The relationship between deciding to set partition scheme and
related details and the configured value of enable_partition_wise_join
needs some consideration. If the only use of the partition scheme is
partition-wise join, there's no point in setting it even for a baserel
unless enable_partition_wise_join is set -- but if there are other
important uses for that data, such as Amit's partition pruning work,
then we might want to always set it. And similarly for a join: if the
details are only needed in the partition-wise join case, then we only
need to set them in that case, but if there are other uses, then it's
different. If it turns out that setting these details for a baserel
is useful in other cases but that it's only for a joinrel in the
partition-wise join case, then the patch gets it exactly right. But
is that correct? I'm not sure.

- The naming of enable_partition_wise_join might also need some
thought. What happens when we also have partition-wise aggregate?
What about the proposal to strength-reduce MergeAppend to Append --
would that use this infrastructure? I wonder if we out to call this
enable_partition_wise or enable_partition_wise_planning to make it a
bit more general. Then, too, I've never really liked having
partition_wise in the GUC name because it might make someone think
that it makes you partitions have a lot of wisdom. Removing the
underscore might help: partitionwise. Or maybe there is some whole
different name that would be better. If anyone wants to bikeshed,
now's the time.

- It seems to me that build_joinrel_partition_info() could be
simplified a bit. One thing is that list_copy() is perfectly capable
of handling a NIL input, so there's no need to test for that before
calling it.

Comments on 0003:

- Instead of reorganizing add_paths_to_append_rel as you did, could
you just add an RTE_JOIN case to the switch? Not sure if there's some
problem with that idea, but it seems like it might come out nicer.

On the overall patch set:

- I am curious to know how this has been tested. How much of the new
code is covered by the tests in 0007-Partition-wise-join-tests.patch?
How much does coverage improve with
0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
code, if any, is not covered by either of those test suites? Could we
do meaningful testing of this with something like Andreas
Seltenreich's sqlsmith?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#248Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Robert Haas (#247)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On the overall patch set:

- I am curious to know how this has been tested. How much of the new
code is covered by the tests in 0007-Partition-wise-join-tests.patch?
How much does coverage improve with
0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
code, if any, is not covered by either of those test suites? Could we
do meaningful testing of this with something like Andreas
Seltenreich's sqlsmith?

FWIW I'm working on an answer to both of those question, but keep
getting distracted by other things catching on fire...

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#249Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#248)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Sep 16, 2017 at 9:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On the overall patch set:

- I am curious to know how this has been tested. How much of the new
code is covered by the tests in 0007-Partition-wise-join-tests.patch?
How much does coverage improve with
0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
code, if any, is not covered by either of those test suites? Could we
do meaningful testing of this with something like Andreas
Seltenreich's sqlsmith?

FWIW I'm working on an answer to both of those question, but keep
getting distracted by other things catching on fire...

I cobbled together some scripts to figure out the test coverage of
lines actually modified by this patch set. Please see attached.

I'm not sure if there is an established or better way to do this, but
I used git-blame to figure out which lines of gcov output can be
blamed on Ashutosh and prepended that to the lines of gcov's output.
That allowed me to find new/changed code not covered by "make check".
I found 94 untested new lines with 0007 applied and 88 untested new
lines with 0008 applied. The 6 lines that 0008 reaches and 0007
doesn't are:

======== src/backend/optimizer/path/allpaths.c ========
-[TOUCHED BY PATCH SET] #####: 3303: mark_dummy_rel(rel);
-[TOUCHED BY PATCH SET] #####: 3304: return;
-[TOUCHED BY PATCH SET] #####: 1515: continue;
-[TOUCHED BY PATCH SET] #####: 1526: continue;
======== src/backend/optimizer/util/pathnode.c ========
-[TOUCHED BY PATCH SET] #####: 3433: break;
-[TOUCHED BY PATCH SET] #####: 3435: return NULL;

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

patchset-coverage-0007.txt.gzapplication/x-gzip; name=patchset-coverage-0007.txt.gz
#250Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Ashutosh Bapat (#246)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On TPC-H benchmarking of this patch, I found a regression in Q7. It
was taking some 1500s with the patch and some 900s without the patch.
Please find the attached pwd_reg.zip for the output of explain analyse
on head and with patch.

The experimental settings used were,
commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
patch-version = v26

Server settings:
work_mem = 1GB
shared_buffers = 10GB
effective_cache_size = 10GB
max_parallel_workers_per_gather = 4

Partitioning information:
Partitioning scheme = by range
Number of partitions in lineitem and orders table = 106
partition key for lineitem = l_orderkey
partition key for orders = o_orderkey

I observe that with partition-wise join patch the planner is using
GatherMerge along-with partition-wise join and on head its not using
GatherMerge. Just to make sure that its partition-wise join which is
causing regression and not GatherMerge, can you please run the query
with enable_gathermerge = false?

That does not sound plausible since around 130s are already spent till
append node. Anyhow, I executed the query with enable_gathermerge =
false, and still it is taking some 1500 secs. Please find the attached
file for the explain analyse output.

I see following lines explain analyze output 7_1.out without the patch
-> Sort (cost=84634030.40..84638520.55 rows=1796063
width=72) (actual time=1061001.435..1061106.608 rows=437209 loops=1)
Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
Sort Method: quicksort Memory: 308912kB
-> Hash Join (cost=16080591.94..84447451.72
rows=1796063 width=72) (actual time=252745.701..1057447.219
rows=1749956 loops=1)
Since Sort doesn't filter any rows, we would expect it to output the
same number of rows as hash join underneath it. But the number of rows
differ in this case. I am wondering whether there's some problem with
the explain analyze output itself.

Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual
time=1556989.996..1556989.997 rows=1 loops=1)
-> Finalize GroupAggregate (cost=83341943.28..83342723.24
rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1
loops=1)
Group Key: n1.n_name, n2.n_name, (date_part('year'::text,
(lineitem_001.l_shipdate)::timestamp without time zone))
-> Sort (cost=83341943.28..83342043.92 rows=40256 width=92)
(actual time=1556989.910..1556989.911 rows=6 loops=1)
Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
Sort Method: quicksort Memory: 27kB
-> Gather (cost=83326804.81..83338864.31 rows=40256
width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1)
Workers Planned: 4
Workers Launched: 4

AFAICU the node above sort is group-aggregate and then there is limit,
and the number of rows for sort node in explain analyse is returned
number of rows. So, what is happening here is once one group is
completed it is aggregated and fetched by limit, now there is no need
for sort to return any more rows and hence the result.

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

Are you talking about [1]? I have explained about the regression in
[2] and [3]. This looks like an issue with the existing costing model.

[1] /messages/by-id/CAOGQiiMwcjNrunJ_fCDBscrTLeJ-CLp7exfzzipe2ut71n4LUA@mail.gmail.com
[2] /messages/by-id/CAFjFpRedUZPa7tKbCLEGK3u5UWdDNQoN=eYfb7ieG5d0D1PbsQ@mail.gmail.com
[3] /messages/by-id/CAFjFpReJKSdCfaeuZjGD79hOETzpz5BKDxLJgxr7qznrXX+TRw@mail.gmail.com
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

7_gm_false.outapplication/octet-stream; name=7_gm_false.out
#251Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#247)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Thanks a lot Robert.

Here are rebased patches.

I didn't get quite as much time to look at these today as I would have
liked, but here's what I've got so far.

Comments on 0001:

- In the RelOptInfo, part_oids is defined in a completely different
part of the structure than nparts, but you can't use it without nparts
because you don't know how long it is. I suggest moving the
definition to just after nparts.

- On the other hand, maybe we should just remove it completely. I
don't see it in any of the subsequent patches. If it's used by the
advanced matching code, let's leave it out of 0001 for now and add it
back after the basic feature is committed.

No, it's not used by advanced partition matching code. It was used by
to match OIDs with the child rels to order those in the array. But now
that we are expanding in EIBO fashion, it is not useful. Should have
removed it earlier. Removed now.

- Similarly, partsupfunc isn't used by the later patches either. It
seems it could also be removed, at least for now.

It's used by advanced partition matching code to compare bounds. It
will be required by partition pruning patch. But removed for now.

- The comment for partexprs isn't very clear about how the lists
inside the array work. My understanding is that the lists have as
many members as the partition key has columns/expressions.

Actually we are doing some preparation for partition-wise join here.
partexprs and nullable_partexprs are used in partition-wise join
implementation patch. I have updated prologue of RelOptInfo structure
with the comments like below

* Note: A base relation will always have only one set of partition keys. But a
* join relation has as many sets of partition keys as the number of joining
* relations. The number of partition keys is given by
* "part_scheme->partnatts". "partexprs" and "nullable_partexprs" are arrays
* containing part_scheme->partnatts elements. Each element of the array
* contains a list of partition key expressions. For a base relation each list
* contains only one expression. For a join relation each list contains at
* most as many expressions as the joining relations. The expressions in a list
* at a given position in the array correspond to the partition key at that
* position. "partexprs" contains partition keys of non-nullable joining
* relations and "nullable_partexprs" contains partition keys of nullable
* joining relations. For a base relation only "partexprs" is populated.

Let me know this looks fine. The logic to match the partition keys of
joining relations in have_partkey_equi_join() and
match_expr_to_partition_keys() becomes simpler if we arrange the
partition key expressions as array indexed by position of partition
key and each array element as list of partition key expressions at
that position.

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

Comments on 0002:

- The relationship between deciding to set partition scheme and
related details and the configured value of enable_partition_wise_join
needs some consideration. If the only use of the partition scheme is
partition-wise join, there's no point in setting it even for a baserel
unless enable_partition_wise_join is set -- but if there are other
important uses for that data, such as Amit's partition pruning work,
then we might want to always set it. And similarly for a join: if the
details are only needed in the partition-wise join case, then we only
need to set them in that case, but if there are other uses, then it's
different. If it turns out that setting these details for a baserel
is useful in other cases but that it's only for a joinrel in the
partition-wise join case, then the patch gets it exactly right. But
is that correct? I'm not sure.

Partition scheme contains the information about data types of
partition keys, which is required to compare partition bounds.
Partition pruning will need to compare constants with partition bounds
and hence will need information contained in partition scheme. So, we
will need to set it for base relations whether or not partition-wise
join is enabled.

- The naming of enable_partition_wise_join might also need some
thought. What happens when we also have partition-wise aggregate?
What about the proposal to strength-reduce MergeAppend to Append --
would that use this infrastructure? I wonder if we out to call this
enable_partition_wise or enable_partition_wise_planning to make it a
bit more general. Then, too, I've never really liked having
partition_wise in the GUC name because it might make someone think
that it makes you partitions have a lot of wisdom. Removing the
underscore might help: partitionwise. Or maybe there is some whole
different name that would be better. If anyone wants to bikeshed,
now's the time.

partitions having a lot of wisdom would be wise_partitions rather than
partition_wise ;).

If partition-wise join is disabled, partition-wise aggregates,
strength reduction of MergeAppend won't be possible on a join tree,
but those will be possible on a base relation. Even if partition-wise
join enabled, one may want to disable other partition-wise
optimizations individually. So, they are somewhat independent
switches. I don't think we should bundle all of those into one.
Whatever names we choose for those GUCs, I think they should have same
naming convention e.g. "partition_wise_xyz". I am open to suggestions
about the names.

- It seems to me that build_joinrel_partition_info() could be
simplified a bit. One thing is that list_copy() is perfectly capable
of handling a NIL input, so there's no need to test for that before
calling it.

partexprs may be NULL for FULL JOIN and nullable_partexprs may be NULL
when there is no nullable relation. So, we have to check existence of
those arrays before accessing lists containing partition key
expressions. list_copy() is being called on individual array elements
and "if" conditions check for the existence of array.

The functions might have become complicated because I am using
outer/inner_partexprs to hold one of the lists and partexprs contains
the array of lists. We may use better named, but I don't have any
better ideas right now. Will think about them.

We could simplify that function according to your suggestion of
nullable_relids. Basically partexprs then contains partition key
expressions all relations nullable and non-nullable. nullable_relids +
pull_varnos() tells us which of those fall on nullable side and which
ones don't. Is this how you are thinking of simplifying it? If we go
with this scheme, again nullable_relids will not be useful for
partition pruning, so may be we should add it as part of 0002
(partition-wise join implementation) instead of 0001.

Comments on 0003:

- Instead of reorganizing add_paths_to_append_rel as you did, could
you just add an RTE_JOIN case to the switch? Not sure if there's some
problem with that idea, but it seems like it might come out nicer.

RTE_JOIN is created only for joins specified using JOIN clause i.e
syntactic joins. The joins created during query planner like rel1,
rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#252Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#251)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

+1.

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

Good point. Also, I'm not sure about cases like this:

SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;

Suppose the relations are all partitioned by (x, y) but that the =
operator is not strict. A partition-wise join is valid between a and
b, but we can't regard w as partitioned any more, because w.x might
contain nulls in partitions where the partitioning scheme wouldn't
allow them. On the other hand, if the subquery were to select a.x,
a.y then clearly it would be fine: there would be no possibility of a
NULL having been substituted for a proper value.

What if the subquery selected a.x, b.y? Initially, I thought that
would be OK too, because of the fact that the a.y = b.y clause is in
the WHERE clause rather than the join condition. But on further
thought I think that probably doesn't work, because with = being a
non-strict operator there's no guarantee that it would remove any
nulls introduced by the left join. Of course, if the subselect had a
WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
list mention those columns would be fine.

- The naming of enable_partition_wise_join might also need some
thought. What happens when we also have partition-wise aggregate?
What about the proposal to strength-reduce MergeAppend to Append --
would that use this infrastructure? I wonder if we out to call this
enable_partition_wise or enable_partition_wise_planning to make it a
bit more general. Then, too, I've never really liked having
partition_wise in the GUC name because it might make someone think
that it makes you partitions have a lot of wisdom. Removing the
underscore might help: partitionwise. Or maybe there is some whole
different name that would be better. If anyone wants to bikeshed,
now's the time.

partitions having a lot of wisdom would be wise_partitions rather than
partition_wise ;).

Well, maybe it's the joins that have a lot of wisdom, then.
enable_partition_wise_join could be read to mean that we should allow
partitioning of joins, but only if those joins know the secret of true
happiness.

If partition-wise join is disabled, partition-wise aggregates,
strength reduction of MergeAppend won't be possible on a join tree,
but those will be possible on a base relation. Even if partition-wise
join enabled, one may want to disable other partition-wise
optimizations individually. So, they are somewhat independent
switches. I don't think we should bundle all of those into one.
Whatever names we choose for those GUCs, I think they should have same
naming convention e.g. "partition_wise_xyz". I am open to suggestions
about the names.

I think the chances of you getting multiple GUCs for different
partition-wise optimizations past Tom are pretty low.

- Instead of reorganizing add_paths_to_append_rel as you did, could
you just add an RTE_JOIN case to the switch? Not sure if there's some
problem with that idea, but it seems like it might come out nicer.

RTE_JOIN is created only for joins specified using JOIN clause i.e
syntactic joins. The joins created during query planner like rel1,
rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there.

OK, never mind that then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#253Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Rafia Sabih (#244)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On TPC-H benchmarking of this patch, I found a regression in Q7. It
was taking some 1500s with the patch and some 900s without the patch.
Please find the attached pwd_reg.zip for the output of explain analyse
on head and with patch.

The experimental settings used were,
commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
patch-version = v26

Server settings:
work_mem = 1GB
shared_buffers = 10GB
effective_cache_size = 10GB
max_parallel_workers_per_gather = 4

Partitioning information:
Partitioning scheme = by range
Number of partitions in lineitem and orders table = 106
partition key for lineitem = l_orderkey
partition key for orders = o_orderkey

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

On completing the benchmark for all queries for the above mentioned
setup, following performance improvement can be seen,
Query | Patch | Head
3 | 1455 | 1631
4 | 499 | 4344
5 | 1464 | 1606
10 | 1475 | 1599
12 | 1465 | 1790

Note that all values of execution time are in seconds.
To summarise, apart from Q4, all other queries are showing somewhat
10-20% improvement. Though it is good but honestly, I expected more
from this feature atleast on this scale factor. I am yet to compare
these performances with the unpartitioned version of the database.

Please find attached file for the output of explain analyse for all
the queries on head and with patch.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

18sept.zipapplication/zip; name=18sept.zip
#254Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#252)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

+1.

Done.

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

Good point. Also, I'm not sure about cases like this:

SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;

Suppose the relations are all partitioned by (x, y) but that the =
operator is not strict. A partition-wise join is valid between a and
b, but we can't regard w as partitioned any more, because w.x might
contain nulls in partitions where the partitioning scheme wouldn't
allow them. On the other hand, if the subquery were to select a.x,
a.y then clearly it would be fine: there would be no possibility of a
NULL having been substituted for a proper value.

What if the subquery selected a.x, b.y? Initially, I thought that
would be OK too, because of the fact that the a.y = b.y clause is in
the WHERE clause rather than the join condition. But on further
thought I think that probably doesn't work, because with = being a
non-strict operator there's no guarantee that it would remove any
nulls introduced by the left join. Of course, if the subselect had a
WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
list mention those columns would be fine.

I am actually not sure whether we can use partition-wise join for a
LEFT JOIN b when the partition key equalities are spread across ON and
WHERE clauses. I am not able to find any example against it, but I am
not able to prove it as well. The reference I used for partition-wise
join [1]https://pdfs.semanticscholar.org/27c2/ba75f8b6a39d4bce85d5579dace609c9abaa.pdf -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company, mentions JOIN conditions i.e. ON clause conditions. But all
the examples used in that paper are that of INNER join. So, I am not
sure what exactly the authors meant by JOIN conditions. Right now I am
restricting the patch to work with only conditions in the ON clause.

Practically most of the operators are strict. OUTER join's WHERE
clause has any partition key equality with strict operator, optimizer
will turn
that OUTER join into an INNER one, turning all clauses into join
clauses. That will enable partition-wise join. So, the current
restriction doesn't restrict any practical cases.

OTOH, I have seen that treating ON and WHERE clauses as same for an
OUTER join leads to surprising results. So, I am leaning to treat them
separate for partition-wise join as well and only use ON clause
conditions for partition-wise join. If we get complaints about
partition-wise join not being picked we will fix them after proving
that it's not harmful. Lifting that restriction is not so difficult.
have_partition_key_equijoin() ignores "pushed down" quals. We have to
just change that condition.

Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL
is interesting. If those conditions are in ON clause, we may still
have a result where b.x and b.y as NULL when no row in "a" matches a
row in "b". If those conditions are in WHERE clause, I think optimizer
will turn the join into an INNER join irrespective of whether the
equality operator is strict.

If partition-wise join is disabled, partition-wise aggregates,
strength reduction of MergeAppend won't be possible on a join tree,
but those will be possible on a base relation. Even if partition-wise
join enabled, one may want to disable other partition-wise
optimizations individually. So, they are somewhat independent
switches. I don't think we should bundle all of those into one.
Whatever names we choose for those GUCs, I think they should have same
naming convention e.g. "partition_wise_xyz". I am open to suggestions
about the names.

I think the chances of you getting multiple GUCs for different
partition-wise optimizations past Tom are pretty low.

We do have enable_hashjoin and enable_hashagg to control use of
hashing for aggregate and join. On similar lines we can have three
GUCs to enable use of partition-wise strategy, one for each of join,
aggregation and sorting. Having granular switches would be useful for
debugging and may be to turn partition-wise strategies off when they
are not optimal. Do we want a switch to turn ON/OFF partition pruning?
Said, that I am fine with single GUC controlling all. We won't set any
partitioning information in RelOptInfo if that GUC is turned OFF.

[1]: https://pdfs.semanticscholar.org/27c2/ba75f8b6a39d4bce85d5579dace609c9abaa.pdf -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v33.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v33.tar.gz
#255Alvaro Herrera
Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Rafia Sabih (#253)
Re: Partition-wise join for join between (declaratively) partitioned tables

Rafia Sabih wrote:

On completing the benchmark for all queries for the above mentioned
setup, following performance improvement can be seen,
Query | Patch | Head
3 | 1455 | 1631
4 | 499 | 4344
5 | 1464 | 1606
10 | 1475 | 1599
12 | 1465 | 1790

Note that all values of execution time are in seconds.
To summarise, apart from Q4, all other queries are showing somewhat
10-20% improvement.

Saving 90% of time on the slowest query looks like a worthy improvement
on its own right. However, you're reporting execution time only, right?
What happens to planning time? In a quick look,

$ grep 'Planning time' pg_part_*/4*
pg_part_head/4_1.out: Planning time: 3390.699 ms
pg_part_head/4_2.out: Planning time: 194.211 ms
pg_part_head/4_3.out: Planning time: 210.964 ms
pg_part_head/4_4.out: Planning time: 4150.647 ms
pg_part_patch/4_1.out: Planning time: 7577.247 ms
pg_part_patch/4_2.out: Planning time: 312.421 ms
pg_part_patch/4_3.out: Planning time: 304.697 ms
pg_part_patch/4_4.out: Planning time: 269.778 ms

I think the noise in these few results is too large to draw any
conclusions. Maybe a few dozen runs of EXPLAIN (w/o ANALYZE) would tell
something significant?

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#256Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Thomas Munro (#249)
2 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Sep 16, 2017 at 2:41 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Sep 16, 2017 at 9:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Sep 16, 2017 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On the overall patch set:

- I am curious to know how this has been tested. How much of the new
code is covered by the tests in 0007-Partition-wise-join-tests.patch?
How much does coverage improve with
0008-Extra-testcases-for-partition-wise-join-NOT-FOR-COMM.patch? What
code, if any, is not covered by either of those test suites? Could we
do meaningful testing of this with something like Andreas
Seltenreich's sqlsmith?

FWIW I'm working on an answer to both of those question, but keep
getting distracted by other things catching on fire...

I cobbled together some scripts to figure out the test coverage of
lines actually modified by this patch set. Please see attached.

I'm not sure if there is an established or better way to do this, but
I used git-blame to figure out which lines of gcov output can be
blamed on Ashutosh and prepended that to the lines of gcov's output.
That allowed me to find new/changed code not covered by "make check".
I found 94 untested new lines with 0007 applied and 88 untested new
lines with 0008 applied. The 6 lines that 0008 reaches and 0007
doesn't are:

======== src/backend/optimizer/path/allpaths.c ========
-[TOUCHED BY PATCH SET] #####: 3303: mark_dummy_rel(rel);
-[TOUCHED BY PATCH SET] #####: 3304: return;
-[TOUCHED BY PATCH SET] #####: 1515: continue;
-[TOUCHED BY PATCH SET] #####: 1526: continue;
======== src/backend/optimizer/util/pathnode.c ========
-[TOUCHED BY PATCH SET] #####: 3433: break;
-[TOUCHED BY PATCH SET] #####: 3435: return NULL;

Two obvious questions:

1. What are we missing in the ~90 lines of non-covered code, and are
there bugs lurking there?

First, here's an easier to read report than the one I posted earlier.
It's based on the whole patch stack (including the extra tests) from
your v33 tarball:

https://codecov.io/gh/postgresql-cfbot/postgresql/commit/19dace6fca0d9c2bca5022158cf28d99aa237550

The main areas of uncovered lines are: code in
get_wholerow_ref_from_convert_row_type() and code that calls it, and
per node type cases in reparameterize_path_by_child(). It seems like
the former could use a test case, and I wonder if there is some way we
could write "flat-copy and then apply recursively to all subpaths"
code like this without having to handle these cases explicitly. There
are a couple of other tiny return cases other than just sanity check
errors which it might be interesting to hit too.

2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit?

I thought about how to answer questions like this and came up with a
shell script that (1) makes computers run really hot for quite a long
time and (2) tells you which blocks of SQL hit which lines of C.
Please find attached the shell script and its output. The .sql files
have been annotated with "block" numbers (blocks being chunks of SQL
stuff separated by blank lines), and the C files annotated with
references to those block numbers where A<n> = block <n>
partition_join.sql and B<n> = block <n> in partition_join_extras.sql.

Then to find lines that B queries hit but A queries don't and know
which particular queries hit them, you might use something like:

grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
grep 'SQL blocks: .*B[0-9]'

(Off topic but by way of explanation: the attachment name ending
.tarball.gz avoids .tgz or .tar.gz so my stupid cfbot doesn't think
it's a new patch set. I need to figure something better out for
that...)

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

coverage.tarball.gzapplication/x-gzip; name=coverage.tarball.gz
blame_coverage_on_queries.shapplication/x-sh; name=blame_coverage_on_queries.sh
#257Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Alvaro Herrera (#255)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 19, 2017 at 3:50 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Rafia Sabih wrote:

On completing the benchmark for all queries for the above mentioned
setup, following performance improvement can be seen,
Query | Patch | Head
3 | 1455 | 1631
4 | 499 | 4344
5 | 1464 | 1606
10 | 1475 | 1599
12 | 1465 | 1790

Note that all values of execution time are in seconds.
To summarise, apart from Q4, all other queries are showing somewhat
10-20% improvement.

Saving 90% of time on the slowest query looks like a worthy improvement
on its own right. However, you're reporting execution time only, right?
What happens to planning time? In a quick look,

Definitely. The planning time issue has been discussed upthread,

On Mon, Mar 20, 2017 at 12:07 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Another minor thing to note that is planning time is almost twice with
this patch, though I understand that this is for scenarios with really
big 'big data' so this may not be a serious issue in such cases, but
it'd be good if we can keep an eye on this that it doesn't exceed the
computational bounds for a really large number of tables..

To which Robert replied as,

Yes, this is definitely going to use significant additional planning
time and memory. There are several possible strategies for improving
that situation, but I think we need to get the basics in place first.
That's why the proposal is now to have this turned off by default.
People joining really big tables that happen to be equipartitioned are
likely to want to turn it on, though, even before those optimizations
are done.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#258Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Thomas Munro (#256)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

The main areas of uncovered lines are: code in
get_wholerow_ref_from_convert_row_type() and code that calls it, and
per node type cases in reparameterize_path_by_child(). It seems like
the former could use a test case, and I wonder if there is some way we
could write "flat-copy and then apply recursively to all subpaths"
code like this without having to handle these cases explicitly. There
are a couple of other tiny return cases other than just sanity check
errors which it might be interesting to hit too.

Under the debugger I checked that the test in partition_join.sql
-- left outer join, with whole-row reference
EXPLAIN (COSTS OFF)
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b
= 0 ORDER BY t1.a, t2.b;
covers get_wholerow_ref_from_convert_row_type(). But it doesn't cover
a couple of lines in the case of nested ConvertRowTypeExpr in that
function. We can add/modify a testcase in multi-level partitioned
table section to cover that.

reparameterize_path_by_child() coverage is hard. It would require that
many different kinds of paths survive in lower joins in the join tree.
It's hard to come up with queries that would do that with limited
amount of data and a reasonable number of tests. Me and Thomas
discussed about his suggestion about "flat-copy and then apply
recursively to all subpaths" which he sees as a path tree mutator. It
won't improve the test coverage. Like expression_tree_mutator() path
mutation is not that widely used phenomenon, so we do not yet know
what should be the characteristics of a path mutator could be. In case
we see more of path mutation code in future, it's an idea worth
considering.

2. What queries in the 0008 patch are hitting lines that 0007 doesn't hit?

I thought about how to answer questions like this and came up with a
shell script that (1) makes computers run really hot for quite a long
time and (2) tells you which blocks of SQL hit which lines of C.
Please find attached the shell script and its output. The .sql files
have been annotated with "block" numbers (blocks being chunks of SQL
stuff separated by blank lines), and the C files annotated with
references to those block numbers where A<n> = block <n>
partition_join.sql and B<n> = block <n> in partition_join_extras.sql.

Then to find lines that B queries hit but A queries don't and know
which particular queries hit them, you might use something like:

grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
grep 'SQL blocks: .*B[0-9]'

Thanks for this. It generates a lot of output (970 lines over all the
coverage files). It will take some time for getting anything
meaningful out of this. May be there's some faster way by looking at
the lines that are covered by B but not A. BTW, I checked those lines
to see if there could be any bug there. But I don't see what could go
wrong with those lines.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#259Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#254)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

Good point. Also, I'm not sure about cases like this:

SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;

Suppose the relations are all partitioned by (x, y) but that the =
operator is not strict. A partition-wise join is valid between a and
b, but we can't regard w as partitioned any more, because w.x might
contain nulls in partitions where the partitioning scheme wouldn't
allow them. On the other hand, if the subquery were to select a.x,
a.y then clearly it would be fine: there would be no possibility of a
NULL having been substituted for a proper value.

What if the subquery selected a.x, b.y? Initially, I thought that
would be OK too, because of the fact that the a.y = b.y clause is in
the WHERE clause rather than the join condition. But on further
thought I think that probably doesn't work, because with = being a
non-strict operator there's no guarantee that it would remove any
nulls introduced by the left join. Of course, if the subselect had a
WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
list mention those columns would be fine.

In my previous reply to this, I probably didn't answer your question
while I explained the restriction on where equality conditions on
partition keys can appear. Here's answer to your questions assuming
those restrictions don't exist. Actually in the example you have
given, optimizer flattens w as a LJ b which kind of makes the
explanations below a bit complicated.

1. SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not between w
and c for the reasons you have explained above.
2. SELECT * FROM (SELECT a.x, a.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b and also between
w and c for the reasons you have explained above.
3. SELECT * FROM (SELECT a.x, b.y FROM a LEFT JOIN b ON a.x = b.x
WHERE a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;
partition-wise join will be possible between a and b but not w and c
as you have explained.

In this case b.x and b.y will appear as nullable_partexprs in w
(represented as a LJ b in optimizer) and a.x and a.y will appear in
partexprs. Depending upon what gets projected out of w, the join
between w and c will use corresponding keys for equality conditions.
Since the operator is non-strict, any expression which is part of
nullable_partexprs will be discarded in
match_expr_to_partition_keys().

Hope that helps.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#260Jeevan Chalke
Jeevan Chalke
jeevan.chalke@enterprisedb.com
In reply to: Ashutosh Bapat (#254)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 19, 2017 at 3:17 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Sep 19, 2017 at 2:35 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Mon, Sep 18, 2017 at 8:02 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

+1.

Done.

- I'm not entirely sure whether maintaining partexprs and
nullable_partexprs is the right design. If I understand correctly,
whether or not a partexpr is nullable is really a per-RTI property,
not a per-expression property. You could consider something like
"Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

Good point. Also, I'm not sure about cases like this:

SELECT * FROM (SELECT b.x, b.y FROM a LEFT JOIN b ON a.x = b.x WHERE
a.y = b.y) w LEFT JOIN c ON w.x = c.x AND w.y = c.y;

Suppose the relations are all partitioned by (x, y) but that the =
operator is not strict. A partition-wise join is valid between a and
b, but we can't regard w as partitioned any more, because w.x might
contain nulls in partitions where the partitioning scheme wouldn't
allow them. On the other hand, if the subquery were to select a.x,
a.y then clearly it would be fine: there would be no possibility of a
NULL having been substituted for a proper value.

What if the subquery selected a.x, b.y? Initially, I thought that
would be OK too, because of the fact that the a.y = b.y clause is in
the WHERE clause rather than the join condition. But on further
thought I think that probably doesn't work, because with = being a
non-strict operator there's no guarantee that it would remove any
nulls introduced by the left join. Of course, if the subselect had a
WHERE clause saying that b.x/b.y IS NOT NULL then having the SELECT
list mention those columns would be fine.

I am actually not sure whether we can use partition-wise join for a
LEFT JOIN b when the partition key equalities are spread across ON and
WHERE clauses. I am not able to find any example against it, but I am
not able to prove it as well. The reference I used for partition-wise
join [1], mentions JOIN conditions i.e. ON clause conditions. But all
the examples used in that paper are that of INNER join. So, I am not
sure what exactly the authors meant by JOIN conditions. Right now I am
restricting the patch to work with only conditions in the ON clause.

Practically most of the operators are strict. OUTER join's WHERE
clause has any partition key equality with strict operator, optimizer
will turn
that OUTER join into an INNER one, turning all clauses into join
clauses. That will enable partition-wise join. So, the current
restriction doesn't restrict any practical cases.

OTOH, I have seen that treating ON and WHERE clauses as same for an
OUTER join leads to surprising results. So, I am leaning to treat them
separate for partition-wise join as well and only use ON clause
conditions for partition-wise join. If we get complaints about
partition-wise join not being picked we will fix them after proving
that it's not harmful. Lifting that restriction is not so difficult.
have_partition_key_equijoin() ignores "pushed down" quals. We have to
just change that condition.

Your last sentence about a clause b.x IS NOT NULL or b.y IS NOT NULL
is interesting. If those conditions are in ON clause, we may still
have a result where b.x and b.y as NULL when no row in "a" matches a
row in "b". If those conditions are in WHERE clause, I think optimizer
will turn the join into an INNER join irrespective of whether the
equality operator is strict.

If partition-wise join is disabled, partition-wise aggregates,
strength reduction of MergeAppend won't be possible on a join tree,
but those will be possible on a base relation. Even if partition-wise
join enabled, one may want to disable other partition-wise
optimizations individually. So, they are somewhat independent
switches. I don't think we should bundle all of those into one.
Whatever names we choose for those GUCs, I think they should have same
naming convention e.g. "partition_wise_xyz". I am open to suggestions
about the names.

I think the chances of you getting multiple GUCs for different
partition-wise optimizations past Tom are pretty low.

We do have enable_hashjoin and enable_hashagg to control use of
hashing for aggregate and join. On similar lines we can have three
GUCs to enable use of partition-wise strategy, one for each of join,
aggregation and sorting. Having granular switches would be useful for
debugging and may be to turn partition-wise strategies off when they
are not optimal.

I think having a granular control over each of these optimization will be
handy for the DBAs too.

Do we want a switch to turn ON/OFF partition pruning?
Said, that I am fine with single GUC controlling all. We won't set any
partitioning information in RelOptInfo if that GUC is turned OFF.

[1] https://pdfs.semanticscholar.org/27c2/ba75f8b6a39d4bce85d5579dace609
c9abaa.pdf
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#261Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#258)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Sep 20, 2017 at 3:13 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Sep 20, 2017 at 9:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

2. What queries in the 0008 patch are hitting lines that 0007 doesn't

hit?

I thought about how to answer questions like this and came up with a
shell script that (1) makes computers run really hot for quite a long
time and (2) tells you which blocks of SQL hit which lines of C.
Please find attached the shell script and its output. The .sql files
have been annotated with "block" numbers (blocks being chunks of SQL
stuff separated by blank lines), and the C files annotated with
references to those block numbers where A<n> = block <n>
partition_join.sql and B<n> = block <n> in partition_join_extras.sql.

Then to find lines that B queries hit but A queries don't and know
which particular queries hit them, you might use something like:

grep -v 'SQL blocks: .*A[0-9]' < joinpath.c.aggregated_coverage | \
grep 'SQL blocks: .*B[0-9]'

Thanks for this. It generates a lot of output (970 lines over all the
coverage files). It will take some time for getting anything
meaningful out of this. May be there's some faster way by looking at
the lines that are covered by B but not A. BTW, I checked those lines
to see if there could be any bug there. But I don't see what could go
wrong with those lines.

I have also tried to find test cases in B which hits some extra line which

is not
hitting by A, with the help of results attached by Thomas in
coverage.tarball_FILES.
It took lot of time but I am able to find some test cases. which if adding
in partition_join.sql
increasing no of lines hit by 14. but for hitting these 14 extra line
attached patch is doing
900+ line inserts in partition_join.sql and partition_join.out file.

I have used gcov-lcov to find coverage for files changed by
partition-wise-join patches
with and without attached patch which is below.

*with existing partition_join.sql* *partition_join.sql + some test cases of
partition_join_extra.sql*
*Modifed Files* *Line Coverage* *Functions* *Line Coverage* *Functions*
src/backend/optimizer/geqo 79.4 % 269/339 96.6 % 28/29 79.4 % 269/339 96.6 %
28/29
src/backend/optimizer/path/allpaths.c 92.3 % 787 / 853 95.5 % 42 / 44
92.6 % 790
/ 853 95.5 % 42 / 44
src/backend/optimizer/path/costsize.c 96.8 % 1415 / 1462 98.4 % 61 / 62
96.9 % 1416 / 1462 98.4 % 61 / 62
src/backend/optimizer/path/joinpath.c 95.5 % 404 / 423 100.0 % 16 / 16
95.5 % 404 / 423 100.0 % 16 / 16
src/backend/optimizer/path/joinrels.c 92.5 % 422 / 456 100.0 % 16 / 16
93.0 % 424 / 456 100.0 % 16 / 16
src/backend/optimizer/plan/createplan.c 90.9 % 1928 / 2122 96.3 % 103 / 107
91.0 % 1930 / 2122 96.3 % 103 / 107
src/backend/optimizer/plan/planner.c 94.9 % 1609 / 1696 97.6 % 41 / 42
94.9 % 1609 / 1696 97.6 % 41 / 42
src/backend/optimizer/plan/setrefs.c 91.3 % 806 / 883 94.3 % 33 / 35 91.3 % 806
/ 883 94.3 % 33 / 35
src/backend/optimizer/prep/prepunion.c 95.5 % 661 / 692 100.0 % 25 / 25
95.5 % 661 / 692 100.0 % 25 / 25
src/backend/optimizer/util/pathnode.c 88.7 % 1144 / 1290 98.1 % 52 / 53
88.8 % 1146 / 1290 98.1 % 52 / 53
src/backend/optimizer/util/placeholder.c 96.5 % 139 / 144 100.0 % 10 / 10
96.5 % 139 / 144 100.0 % 10 / 10
src/backend/optimizer/util/plancat.c 89.0 % 540 / 607 94.7 % 18 / 19 89.6 % 544
/ 607 94.7 % 18 / 19
src/backend/optimizer/util/relnode.c 95.3 % 548 / 575 100.0 % 24 / 24
95.3 % 548
/ 575 100.0 % 24 / 24
src/backend/utils/misc/guc.c 67.4 % 1536 / 2278 89.7 % 113 / 126 67.4 % 1536
/ 2278 89.7 % 113 / 126

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

partition_join_with_some_testcases_from_extra.patchtext/x-patch; charset=US-ASCII; name=partition_join_with_some_testcases_from_extra.patch
#262Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#254)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Done.

Committed 0001 with extensive editorialization. I did not think it
was a good idea to include a partition.h a file in src/include/nodes,
so I worked around that. The include of pg_inherits_fn.h was
unneeded. I rewrote a lot of the comments and made some other style
tweaks.

Don't look now, but I think it might be about time for the main act.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#263Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#262)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 21, 2017 at 9:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 19, 2017 at 5:47 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Done.

Committed 0001 with extensive editorialization. I did not think it
was a good idea to include a partition.h a file in src/include/nodes,
so I worked around that. The include of pg_inherits_fn.h was
unneeded. I rewrote a lot of the comments and made some other style
tweaks.

Thanks a lot Robert. Thanks for changing comments to be more precise and crisp.

Here's set of rebased patches. The patch with extra tests is not for
committing. All other patches, except the last one, will need to be
committed together. The last patch may be committed along with other
patches or as a separate patch.

About your earlier comment of making build_joinrel_partition_info()
simpler. Right now, the code assumes that partexprs or
nullable_partexpr can be NULL when either of them is not populated.
That may be saves a sizeof(pointer) * (number of keys) byes of memory.
Saving that much memory may not be worth the complexity of code. So,
we may always allocate memory for those arrays and fill it with NIL
values when there are no key expressions to populate those. That will
simplify the code. I haven't done that change in this patchset. I was
busy debugging the Q7 regression. Let me know your comments about
that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v34.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v34.tar.gz
#264Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#263)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

About your earlier comment of making build_joinrel_partition_info()
simpler. Right now, the code assumes that partexprs or
nullable_partexpr can be NULL when either of them is not populated.
That may be saves a sizeof(pointer) * (number of keys) byes of memory.
Saving that much memory may not be worth the complexity of code. So,
we may always allocate memory for those arrays and fill it with NIL
values when there are no key expressions to populate those. That will
simplify the code. I haven't done that change in this patchset. I was
busy debugging the Q7 regression. Let me know your comments about
that.

Hmm, I'm not sure that's the best approach, but let me look at it more
carefully before I express a firm opinion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#265Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#250)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Sep 18, 2017 at 10:18 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

Limit (cost=83341943.28..83341943.35 rows=1 width=92) (actual
time=1556989.996..1556989.997 rows=1 loops=1)
-> Finalize GroupAggregate (cost=83341943.28..83342723.24
rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1
loops=1)
Group Key: n1.n_name, n2.n_name, (date_part('year'::text,
(lineitem_001.l_shipdate)::timestamp without time zone))
-> Sort (cost=83341943.28..83342043.92 rows=40256 width=92)
(actual time=1556989.910..1556989.911 rows=6 loops=1)
Sort Key: n1.n_name, n2.n_name,
(date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
time zone))
Sort Method: quicksort Memory: 27kB
-> Gather (cost=83326804.81..83338864.31 rows=40256
width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1)
Workers Planned: 4
Workers Launched: 4

AFAICU the node above sort is group-aggregate and then there is limit,
and the number of rows for sort node in explain analyse is returned
number of rows. So, what is happening here is once one group is
completed it is aggregated and fetched by limit, now there is no need
for sort to return any more rows and hence the result.

Thanks for your explanation. That makes sense. I forgot about LIMIT node on top.

I debugged the plans today and performed some experiments. Here are my
observations

The join order with and without partition-wise join changes. Without
partition-wise join it is
(lineitem, (suppliers, nation1)), (orders, (customer, nation2)). The
join (lineitem, (suppliers, nation1)) is executed by one gather node
and (orders, (customer, nation2)) is executed by other. Thus the plan
has two gather nodes, which feed to the topmost join.
With partition-wise join the join order is ((lineitem, orders),
(supplier, nation1)), (customer, nation2). The join (lineitem, orders)
uses partition-wise join. This plan executes the whole join tree along
with partial group aggregation under a gather merge.

The rows estimated for various nodes under Gather/GatherMerge are
different from the actual rows e.g.
-> Hash Join (cost=113164.47..61031454.40 rows=10789501 width=46)
(actual time=3379.931..731987.943 rows=8744357 loops=5) (in
non-partition-wise join plan) OR
-> Append (cost=179532.36..80681785.95 rows=134868761 width=24)
(actual time=9437.573..1360219.567 rows=109372134 loops=5) (in
partition-wise join plan).
I first thought that this is a real estimation error and spent some
time investigating the estimation error. But eventually realised that
this is how a parallel query plan reports, when I saw that Gather node
estimated correct number of rows even though the nodes under it showed
this difference. Here's the explanation of this report. There are 4
parallel workers, so, the leaders contribution would be estimated to
be 0 by get_parallel_divisor(). So these estimates are per worker and
so the total estimated rows produced by any of the nodes is 4 times
the reported. But when the query actually runs, the leader also
participates, so number of loops = 5 and the actual rows reported are
(total actual rows) / (number of loops i.e. number of backends that
executed the query). The total estimates rows and total actual rows
are roughly equal. So there's no real estimation error, as I thought
earlier. May be we want to make EXPLAIN (ANALYZE) output easier to
understand.

When I tried the same query on laptop with scale 20, I found that the
leader is really contributing as much as other workers. So, the
partial paths were really created based on an estimate which was 20%
off. The cost difference between partition-wise join plan and
non-partition-wise join plan is hardly 1.5%. So, it's possible that if
we correct this estimation error, partition-wise join plan won't be
chosen because of it will have a higher cost. Remember there are two
gather nodes in non-partition-wise join plan and partition-wise join
plan has one gather. So, non-partition-wise join path gets the 20%
decreased estimates twice and partition-wise join gets it only once.

The explain (analyze, verbose) of a parallel node looks like
-> Parallel Seq Scan on public.lineitem_002 (cost=0.00..168752.99
rows=573464 width=24) (actual time=1.395..3075.485 rows=454464
loops=5)
Filter:
((lineitem_002.l_shipdate >= '1995-01-01'::date) AND
(lineitem_002.l_shipdate <= '1996-12-31'::date))
Rows Removed by Filter: 1045065
Worker 0: actual
time=3.358..3131.426 rows=458267 loops=1
Worker 1: actual
time=0.860..3146.282 rows=447231 loops=1
Worker 2: actual
time=1.317..3123.646 rows=489960 loops=1
Worker 3: actual
time=0.927..3130.497 rows=475545 loops=1
If we sum the rows returned by each worker they don't add up to
(actual rows) * (actual loops). So I assumed that the unreported
number of rows were processed by the leader. Is that right?

I might be misunderstanding how parallel query works, but here's my
analysis so far. I will continue investigating further.

Any clues would be helpful.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#266Rafia Sabih
Rafia Sabih
rafia.sabih@enterprisedb.com
In reply to: Rafia Sabih (#253)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Sep 19, 2017 at 2:58 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On Fri, Sep 15, 2017 at 2:09 PM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On TPC-H benchmarking of this patch, I found a regression in Q7. It
was taking some 1500s with the patch and some 900s without the patch.
Please find the attached pwd_reg.zip for the output of explain analyse
on head and with patch.

The experimental settings used were,
commit-id = 0c504a80cf2e6f66df2cdea563e879bf4abd1629
patch-version = v26

Server settings:
work_mem = 1GB
shared_buffers = 10GB
effective_cache_size = 10GB
max_parallel_workers_per_gather = 4

Partitioning information:
Partitioning scheme = by range
Number of partitions in lineitem and orders table = 106
partition key for lineitem = l_orderkey
partition key for orders = o_orderkey

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

On completing the benchmark for all queries for the above mentioned
setup, following performance improvement can be seen,
Query | Patch | Head
3 | 1455 | 1631
4 | 499 | 4344
5 | 1464 | 1606
10 | 1475 | 1599
12 | 1465 | 1790

Note that all values of execution time are in seconds.

I compared this experiment with non-partitioned database and following
is the result,
Query | Non-partitioned head
3 | 1752
4 | 315
5 | 2319
10 | 1535
12 | 1739

In summary, the query that appears slowest in partitioned database is
not so otherwise. It is good to see that in Q4 partition-wise join
helps in achieving performance closer to it's non-partitioned case,
otherwise partitioning alone causes it to suffer greatly. Apart from
Q4 it does not looks like partitioning hurts anywhere else, though the
maximum improvement is ~35% for Q5.
Another point to note here is that the performance on partitioned and
unpartitioned heads are quite close (except Q4) which is something
atleast I wasn't expecting. It looks like we need not to partition the
tables anyway, or atleast this set of queries doesn't benefit from
partitioning. Please let me know if somebody has better ideas on how
partitioning schemes should be applied to make it more beneficial for
these queries.

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

Attachments:

pg_unpart.zipapplication/zip; name=pg_unpart.zip
#267Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rafia Sabih (#266)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 22, 2017 at 10:45 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:

On completing the benchmark for all queries for the above mentioned
setup, following performance improvement can be seen,
Query | Patch | Head
3 | 1455 | 1631
4 | 499 | 4344
5 | 1464 | 1606
10 | 1475 | 1599
12 | 1465 | 1790

Note that all values of execution time are in seconds.

I compared this experiment with non-partitioned database and following
is the result,
Query | Non-partitioned head
3 | 1752
4 | 315
5 | 2319
10 | 1535
12 | 1739

In summary, the query that appears slowest in partitioned database is
not so otherwise. It is good to see that in Q4 partition-wise join
helps in achieving performance closer to it's non-partitioned case,
otherwise partitioning alone causes it to suffer greatly. Apart from
Q4 it does not looks like partitioning hurts anywhere else, though the
maximum improvement is ~35% for Q5.
Another point to note here is that the performance on partitioned and
unpartitioned heads are quite close (except Q4) which is something
atleast I wasn't expecting. It looks like we need not to partition the
tables anyway, or atleast this set of queries doesn't benefit from
partitioning. Please let me know if somebody has better ideas on how
partitioning schemes should be applied to make it more beneficial for
these queries.

Just partitioning is not expected to improve query performance (but we
still see some performance improvement). Partitioning + partition-wise
operations, pruning is expected to show performance gains. IIUC the
results you reported, Q3 takes 1752 seconds with non-partitioned head,
with partitioning it completes in 1631 seconds and with partition-wise
join it completes in 1455, so net improvement because of partitioning
is 300 seconds is almost 16% improvement, which is a lot for very
large data. So, except Q4, every query improves when the tables are
partitioned. Am I interpreting the results correctly?

There may be some other way of partitioning, which may give better
results, but I think what we have now shows the importance of
partitioning in case of very large data e.g. scale 300 TPCH.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#268Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#246)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Sep 15, 2017 at 5:29 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Apart from these there is a regression case on a custom table, on head
query completes in 20s and with this patch it takes 27s. Please find
the attached .out and .sql file for the output and schema for the test
case respectively. I have reported this case before (sometime around
March this year) as well, but I am not sure if it was overlooked or is
an unimportant and expected behaviour for some reason.

Are you talking about [1]? I have explained about the regression in
[2] and [3]. This looks like an issue with the existing costing model.

I debugged this case further. There are two partitioned tables being
joined prt (with partitions prt_p1, prt_p2 and so on) and prt2 (with
partitions prt2_p1, prt2_p2, and so on). When join is executed without
partition-wise join, prt2 is used to build hash table and prt is used
to probe that hash table. prt2 has lesser number of rows than prt. But
when partition-wise join is used, individual partitions are joined in
reverse join order i.e. partitions of prt are used to build the hash
table and partitions of prt2 are used to probe. This happens because
the path for the other join order (partition of prt2 used to build the
hash table and partition of prt used to probe) has huge cost compared
to the first one (74459 and 313109) and a portion worth 259094 comes
from lines 3226/7 of final_cost_hashjoin()
3215 /*
3216 * The number of tuple comparisons needed is the number of outer
3217 * tuples times the typical number of tuples in a hash
bucket, which
3218 * is the inner relation size times its bucketsize
fraction. At each
3219 * one, we need to evaluate the hashjoin quals. But actually,
3220 * charging the full qual eval cost at each tuple is pessimistic,
3221 * since we don't evaluate the quals unless the hash values match
3222 * exactly. For lack of a better idea, halve the cost estimate to
3223 * allow for that.
3224 */
3225 startup_cost += hash_qual_cost.startup;
3226 run_cost += hash_qual_cost.per_tuple * outer_path_rows *
3227 clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

That's because for some reason innerbucketsize for partition of prt is
22 times more than that for partition of prt2. Looks like we have some
estimation error in estimating bucket sizes.

If I force partitions to be joined with the same order as partitioned
tables (without partition-wise join), child-joins execute faster and
in turn partition-wise join performs better than the
non-partition-wise join. So, this is clearly some estimation and
costing problem with regular joins.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#269Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#263)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of rebased patches. The patch with extra tests is not for
committing. All other patches, except the last one, will need to be
committed together. The last patch may be committed along with other
patches or as a separate patch.

In set_append_rel_size, is it necessary to set attr_needed =
bms_copy(rel->attr_needed[index]) rather than just pointing to the
existing value? If so, perhaps the comments should explain the
reasons. I would have thought that the values wouldn't change after
this point, in which case it might not be necessary to copy them.

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

The elog(DEBUG3) in try_partition_wise_join() doesn't follow message
style guidelines and I think should just be removed. It was useful
for development, I'm sure, but it's time for it to go.

+ elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path));

I think we should use the same formulation as elsewhere, namely
"unrecognized node type: %d". And likewise probably "unexpected join
type: %d".

partition_join_extras.sql has a bunch of whitespace damage, although
it doesn't really matter since, as you say, that's not for commit.

(This is not a full review, just a few thoughts.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#270Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#269)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Oct 3, 2017 at 7:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's set of rebased patches. The patch with extra tests is not for
committing. All other patches, except the last one, will need to be
committed together. The last patch may be committed along with other
patches or as a separate patch.

In set_append_rel_size, is it necessary to set attr_needed =
bms_copy(rel->attr_needed[index]) rather than just pointing to the
existing value? If so, perhaps the comments should explain the
reasons. I would have thought that the values wouldn't change after
this point, in which case it might not be necessary to copy them.

Right. The only places where attr_needed is changed is in
remove_rel_from_query() (useless join removal) and
add_vars_to_targetlist(). Both of those happen before
set_append_rel_size(). Since parent and child join should project same
attributes, having them share the Relids set makes more sense. So,
changed accordingly and explained the same in comments.

Also, changed list_nth() in the following code block to use list_nth_node().

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

Partition-wise join is standard term used in literature and in
documentation of other popular DBMSes, so partition_wise makes more
sense. But I am fine with partition_join as well. Do you want it
partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
etc.?

The elog(DEBUG3) in try_partition_wise_join() doesn't follow message
style guidelines and I think should just be removed. It was useful
for development, I'm sure, but it's time for it to go.

Done.

+ elog(ERROR, "unrecognized path node type %d", (int) nodeTag(path));

I think we should use the same formulation as elsewhere, namely
"unrecognized node type: %d". And likewise probably "unexpected join
type: %d".

Changed "unrecognized path node type" to "unrecognized node type".

"unrecognized join type: %d" seems to be used everywhere except
postgres_fdw. So, used that. Also added a cast to int similar to other
places.

partition_join_extras.sql has a bunch of whitespace damage, although
it doesn't really matter since, as you say, that's not for commit.

Right. I will remove that patch from the patch-set when those tests
are no more needed i.e. once we are done with code changes to the
patches.

Attached the updated patch-set.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v35.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v35.tar.gz
#271Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#270)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

Partition-wise join is standard term used in literature and in
documentation of other popular DBMSes, so partition_wise makes more
sense. But I am fine with partition_join as well. Do you want it
partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
etc.?

Well, you're making me have second thoughts. It's really just that
partition_wise looks a little awkward to me, and maybe that's not
enough reason to change anything. I suppose if I commit it this way
and somebody really hates it, it can always be changed later. We're
not getting a lot of input from anyone else at the moment.

Attached the updated patch-set.

I decided to skip over 0001 for today and spend some time looking at
0002-0006. Comments below.

0002:

Looks fine.

0003:

The commit message mentions estimate_num_groups but the patch doesn't touch it.

I am concerned that this patch might introduce some problem fixed by
commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that
patch say, at one place, that "This protects against possible
incorrect matches to child expressions that contain no Vars."
However, if a child expression has no Vars, then I think em->em_relids
will be empty, so the bms_is_equal() test that is there now will fail
but your proposed bms_is_subset() test will pass.

0004:

I suggest renaming get_wholerow_ref_from_convert_row_type to
is_converted_whole_row_reference and making it return a bool.

The coding of that function is a little strange; why not move Var to
an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var =
castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

Will the statement that "In case of multi-level partitioning, we will
have as many nested ConvertRowtypeExpr as there are levels in
partition hierarchy" be falsified by Amit Khandekar's pending patch to
avoid sticking a ConvertRowTypeExpr on top of another
ConvertRowTypeExpr? Even if the answer is "no", I think it might be
better to drop this part of the comment; it would be easy for it to
become false in the future, because we might want to optimize that
case in the future and we'll probably forget to update this comment
when we do.

In fix_upper_expr_mutator(), you have an if statement whose entire
contents are another if statement. I think you should use && instead,
and maybe reverse the order of the tests, since
context->subplan_itlist->has_conv_whole_rows is probably cheaper to
test than a function call. It's also a little strange that this code
isn't adjacent too, or merged with, the existing has_non_vars case.
Maybe:

converted_whole_row = is_converted_whole_row_reference(node);
if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
(context->outer_itlist->has_conv_whole_rows && converted_whole_row))
...
if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
(context->inner_itlist->has_conv_whole_rows && converted_whole_row))
...

0005:

The comment explaining why the ParamPathInfo is allocated in the same
context as the RelOptInfo is a modified copy of an existing comment
that still reads like the original, a manner of commenting I find a
bit undesirable as it leads to filling up the source base with
duplicate comments.

I don't think I believe that comment, either. In the case from which
that comment was copied (mark_dummy_rel), it was talking about a
RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
to joinrels creating during a GEQO cycle. But there's no similar
logic for ppilist, so I think what will happen here is that you'll end
up with a freed node in the middle of the list.

I think reparameterize_path_by_chid() could use a helper function
reparameterize_pathlist_by_child() that iterates over a list of paths
and returns a list of paths. That would remove some of the loops.

I think the comments for reparameterize_path_by_child() need to be
expanded. They don't explain how you decided which nodes need to be
handled here or which fields within those nodes need some kind of
handling other than a flat-copy. I think these kinds of explanations
will be important for future maintenance of this code. You know why
you did it this way, I can mostly guess what you did it this way, but
what about the next person who comes along who hasn't made a detailed
study of partition-wise join?

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child(). It's just falling through to
the default case.

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have to be prepared for that and skip building the path. But that
would be more like what reparameterize_path() does, and it would make
failure to include some relevant path type here a corner-case
performance bug rather than a correctness issue. It seems like
someone adding a new path type could quite easily fail to realize that
it might need to be added here, or might be unsure whether it's
necessary to add it here.

0006:

I have some doubts about how stable all of the EXPLAIN outputs are
going to be on the buildfarm. I'm not sure what we can really do
about that in advance of trying them, but it's a lot of EXPLAIN
output. If you have an ideas about how to tighten it up without
losing test coverage, that would be good. For example, maybe the
"full outer join" case isn't needed given the following test case
which is also a full outer join but which covers additional behavior.

I think it would be good to have a test case that shows multi-level
partition-wise join working across multiple levels. I wrote the
attached test, which you're welcome to use if you like it, adapt if
you sorta like it, or replace if you dislike it. The table names at
least should be changed to something less likely to duplicate other
tests.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

mlpartjoin.sqlapplication/octet-stream; name=mlpartjoin.sql
#272Amit Langote
Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#271)
Re: Partition-wise join for join between (declaratively) partitioned tables

On 2017/10/04 4:27, Robert Haas wrote:

On Tue, Oct 3, 2017 at 8:57 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

Partition-wise join is standard term used in literature and in
documentation of other popular DBMSes, so partition_wise makes more
sense. But I am fine with partition_join as well. Do you want it
partition_join or partitionjoin like enable_mergejoin/enable_hashjoin
etc.?

Well, you're making me have second thoughts. It's really just that
partition_wise looks a little awkward to me, and maybe that's not
enough reason to change anything. I suppose if I commit it this way
and somebody really hates it, it can always be changed later. We're
not getting a lot of input from anyone else at the moment.

FWIW, the name enable_partition_join seems enough to convey the core
feature, that is, I see "_wise" as redundant, even though I'm now quite
used to seeing "_wise" in the emails here and saying it out loud every now
and then. Ashutosh may have a point though that users coming from other
databases might miss the "_wise". :)

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#273Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#271)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 4, 2017 at 12:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

0003:

The commit message mentions estimate_num_groups but the patch doesn't touch it.

This was fixed when we converted many rel->reloptkind ==
RELOPT_BASEREL to IS_SIMPLE_REL(). I have removed this section from
the commit message.

I am concerned that this patch might introduce some problem fixed by
commit dd4134ea56cb8855aad3988febc45eca28851cd8. The comment in that
patch say, at one place, that "This protects against possible
incorrect matches to child expressions that contain no Vars."
However, if a child expression has no Vars, then I think em->em_relids
will be empty, so the bms_is_equal() test that is there now will fail
but your proposed bms_is_subset() test will pass.

bms_is_equal() was enough when there was only a single member in
relids but it doesn't work now that there can be multiple of them.
bms_is_equal() was replaced with bms_is_subset() to accomodate for
ec_members with only a subset of relids when we are searching for a
join relation.

I am not sure whether your assumption that expression with no Vars
would have em_relids empty is correct. I wonder whether we will add
any em_is_child members with empty em_relids; looking at
process_equivalence() those come from RestrictInfo::left/right_relids
which just indicates the relids at which that particular expression
can be evaluated. Place holder vars is an example when that can
happen, but there may be others. To verify this, I tried attached
patch on master and ran make check. The assertion didn't trip. If
em_relids is not NULL, bms_is_subset() is fine.

If em_relids could indeed go NULL when em_is_child is true, passing
NULL relids (for parent rels) to that function can cause unwanted
behaviour. bms_is_equal(em->em_relids, relids) will return true
turning the if (em->em_is_child && !bms_is_equal(em->em_relids,
relids)) to false. This means that we will consider a child member
with em_relids NULL even while matching a parent relation. What
surprises me is, that commit added a bunch of testcases and none of
them failed with this change.

Nonetheless, I have changed "matches" with "belongs to" in the
prologue of those functions since an exact match won't be possible
with child-joins.

0004:

I suggest renaming get_wholerow_ref_from_convert_row_type to
is_converted_whole_row_reference and making it return a bool.

Done.

The coding of that function is a little strange; why not move Var to
an inner scope? Like this: if (IsA(convexpr->arg, var)) { Var *var =
castNode(Var, convexpr->arg; if (var->varattno == 0) return var; }

I probably went too far to avoid indented code :). Fixed now.

Will the statement that "In case of multi-level partitioning, we will
have as many nested ConvertRowtypeExpr as there are levels in
partition hierarchy" be falsified by Amit Khandekar's pending patch to
avoid sticking a ConvertRowTypeExpr on top of another
ConvertRowTypeExpr? Even if the answer is "no", I think it might be
better to drop this part of the comment; it would be easy for it to
become false in the future, because we might want to optimize that
case in the future and we'll probably forget to update this comment
when we do.

That might keep someone wondering where the nested
ConvertRowtypeExpr's came from. But may be in future those can arise
from something other than multi-level partition hierarchy and in that
case too the comment would be rendered inaccurate. So done.

In fix_upper_expr_mutator(), you have an if statement whose entire
contents are another if statement. I think you should use && instead,
and maybe reverse the order of the tests, since
context->subplan_itlist->has_conv_whole_rows is probably cheaper to
test than a function call. It's also a little strange that this code
isn't adjacent too, or merged with, the existing has_non_vars case.
Maybe:

converted_whole_row = is_converted_whole_row_reference(node);
if (context->outer_itlist && (context->outer_itlist->has_non_vars ||
(context->outer_itlist->has_conv_whole_rows && converted_whole_row))
...
if (context->inner_itlist && (context->inner_itlist->has_non_vars ||
(context->inner_itlist->has_conv_whole_rows && converted_whole_row))

I placed it with the other node types since it's for a specific node
type, but I guess your suggestion avoids duplicates and looks better.
Done.

...

0005:

The comment explaining why the ParamPathInfo is allocated in the same
context as the RelOptInfo is a modified copy of an existing comment
that still reads like the original, a manner of commenting I find a
bit undesirable as it leads to filling up the source base with
duplicate comments.

I have pointed to mark_dummy_rel() in that comment instead of
duplicating the whole paragraph.

I don't think I believe that comment, either. In the case from which
that comment was copied (mark_dummy_rel), it was talking about a
RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
to joinrels creating during a GEQO cycle. But there's no similar
logic for ppilist, so I think what will happen here is that you'll end
up with a freed node in the middle of the list.

In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
with dummy path being created in the same context as the RelOptInfo.
Same applies here. While reparameterizing a path tree, we may reach a
path for a base relation and create a PPI for a base relation. This
may happen when GEQO is planning a join, and thus we are in a
short-lived context created by that GEQO cycle. We don't want a base
rel PPI to be created in that context, so instead we use the context
of base rel itself. Other way round, we don't want to use a longer
context for creating PPI for a join relation when it's created by a
GEQO cycle. So, we use join relation's context.The code doesn't free
up a node in the middle of the list but it avoids such an anomaly. See
[1]: /messages/by-id/CAFjFpRcPutbr4nVAsrY-5q=wCFrNK25_3MNhHgyYYM0yeOoj=Q@mail.gmail.com

I think reparameterize_path_by_chid() could use a helper function
reparameterize_pathlist_by_child() that iterates over a list of paths
and returns a list of paths. That would remove some of the loops.

That's a good idea. Done.

I think the comments for reparameterize_path_by_child() need to be
expanded. They don't explain how you decided which nodes need to be
handled here or which fields within those nodes need some kind of
handling other than a flat-copy. I think these kinds of explanations
will be important for future maintenance of this code. You know why
you did it this way, I can mostly guess what you did it this way, but
what about the next person who comes along who hasn't made a detailed
study of partition-wise join?

We need to reparameterize any path which contains further paths and/or
contains expressions that point to the parent relation. For a given
path we need to reparameterize any paths that it contains and
translate any expressions that are specific to that path. Expressions
common across the paths are translated after the switch case. I have
added this rule to the comment just above the switch case
/*
* Copy of the given path. Reparameterize any paths referenced by the given
* path. Replace parent Vars in path specific expressions by corresponding
* child Vars.
*/
Does that look fine or we want to add explanation for every node handled here.

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child(). It's just falling through to
the default case.

I added those cases separately to explain why we should not see those
cases in that switch case. I think that explanation is important
(esp. considering your comment above) and associating those comment
with "case" statement looks better. Are you suggesting that we should
add that explanation in default case?

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have to be prepared for that and skip building the path. But that
would be more like what reparameterize_path() does, and it would make
failure to include some relevant path type here a corner-case
performance bug rather than a correctness issue. It seems like
someone adding a new path type could quite easily fail to realize that
it might need to be added here, or might be unsure whether it's
necessary to add it here.

I am OK with that. However reparameterize_path_by_child() and
reparameterize_paths_by_child() are callers of
reparameterize_path_by_child() so they will need to deal with NULL
return. I am fine with that too, but making sure that we are on the
same page. If we do that, we could simply assert that the switch case
doesn't see T_SubqueryScanPath and T_ResultPath.

0006:

I have some doubts about how stable all of the EXPLAIN outputs are
going to be on the buildfarm. I'm not sure what we can really do
about that in advance of trying them, but it's a lot of EXPLAIN
output. If you have an ideas about how to tighten it up without
losing test coverage, that would be good. For example, maybe the
"full outer join" case isn't needed given the following test case
which is also a full outer join but which covers additional behavior.

Yes, I too am thinking about the same. The only reason I have EXPLAIN
output there is to check whether partition-wise join is being used or
not. The testcase is not interested in the actual shape. It doesn't
make sense to just test the output if partition-wise join is not used.
May be a function examining the plan tree would help. The function
will have to handle Result/Sort nodes on top and make sure that Append
has join children. Do you have any other idea to check the shape of
the plan tree without the details? Any EXPLAIN switch, existing
functions etc.?

Removed the extra full outer join testcase.

I think it would be good to have a test case that shows multi-level
partition-wise join working across multiple levels. I wrote the
attached test, which you're welcome to use if you like it, adapt if
you sorta like it, or replace if you dislike it. The table names at
least should be changed to something less likely to duplicate other
tests.

There are tests for multi-level partitioned table in the file. They
test whole partition hierarchy join, part of it being joined based on
the quals. Search for
--
-- multi-leveled partitions
--

Have you looked at those? They test two-level partitioned tables and
your test tests three-level partitioned table. I can modify the tests
to have three levels of partitions and different partition schemes on
different levels. Is that what you expect?

[1]: /messages/by-id/CAFjFpRcPutbr4nVAsrY-5q=wCFrNK25_3MNhHgyYYM0yeOoj=Q@mail.gmail.com

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

em_is_child_em_relids.patchtext/x-patch; charset=US-ASCII; name=em_is_child_em_relids.patch
#274Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#264)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

About your earlier comment of making build_joinrel_partition_info()
simpler. Right now, the code assumes that partexprs or
nullable_partexpr can be NULL when either of them is not populated.
That may be saves a sizeof(pointer) * (number of keys) byes of memory.
Saving that much memory may not be worth the complexity of code. So,
we may always allocate memory for those arrays and fill it with NIL
values when there are no key expressions to populate those. That will
simplify the code. I haven't done that change in this patchset. I was
busy debugging the Q7 regression. Let me know your comments about
that.

Hmm, I'm not sure that's the best approach, but let me look at it more
carefully before I express a firm opinion.

Having studied this a bit more, I now think your proposed approach is
a good idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#275Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#271)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I decided to skip over 0001 for today and spend some time looking at
0002-0006.

Back to 0001.

+        Enables or disables the query planner's use of partition-wise join
+        plans. When enabled, it spends time in creating paths for joins between
+        partitions and consumes memory to construct expression nodes to be used
+        for those joins, even if partition-wise join does not result in the
+        cheapest path. The time and memory increase exponentially with the
+        number of partitioned tables being joined and they increase linearly
+        with the number of partitions. The default is <literal>off</>.

I think this is too scary and too much technical detail. I think you
could just say something like: Enables or disables use of
partition-wise join, which allows a join between partitioned tables to
be performed by joining the matching partitions. Partition-wise join
currently applies only when the join conditions include all the
columns of the partition keys, which must be of the same data type and
have exactly matching sets of child partitions. Because
partition-wise join planning can use significantly increase CPU time
and memory usage during planning, the default is <literal>off</>.

+partitioned table. The join partners can not be found in other partitions. This
+condition allows the join between partitioned tables to be broken into joins
+between the matching partitions. The resultant join is partitioned in the same

"The join partners can not be found in other partitions." is redundant
with the previous sentence. I suggest deleting it. I also suggest
"This condition allows the join between partitioned tables to be
broken" -> "Because of this, the join between partitioned tables can
be broken".

+relation" for both partitioned table as well as join between partitioned tables
+which can use partition-wise join technique.

for either a partitioned table or a join between compatibly partitioned tables

+Partitioning properties of a partitioned relation are stored in
+PartitionSchemeData structure. Planner maintains a list of canonical partition
+schemes (distinct PartitionSchemeData objects) so that any two partitioned
+relations with same partitioning scheme share the same PartitionSchemeData
+object. This reduces memory consumed by PartitionSchemeData objects and makes
+it easy to compare the partition schemes of joining relations.

Not all of the partitioning properties are stored in the
PartitionSchemeData structure any more. I think this needs some
rethinking and maybe some expansion. As written, each of the first
two sentences needs a "the" at the beginning.

+                               /*
+                                * Create "append" paths for
partitioned joins. Do this before
+                                * creating GatherPaths so that
partial "append" paths in
+                                * partitioned joins will be considered.
+                                */

I think you could shorten this to a single-line comment and just keep
the first sentence. Similarly in the other location where you have
the same sort of thing.

+ * child-joins. Otherwise, add_path might delete a path that some "append"
+ * path has reference to.

to which some path generated here has a reference.

Here and elsewhere, you use "append" rather than Append to refer to
the paths added. I suppose that's weasel-wording to work around the
fact that they might be either Append or MergeAppend paths, but I'm
not sure it's really going to convey that to anyone. I suggest
rephrasing those comments more generically, e.g.:

+ /* Add "append" paths containing paths from child-joins. */

You could say: Build additional paths for this rel from child-join paths.

Or something.

+       if (!REL_HAS_ALL_PART_PROPS(rel))
+               return;

Isn't this an unnecessarily expensive test? I mean, it shouldn't be
possible for it to have some arbitrary subset.

+       /*
+        * Every pair of joining relations we see here should have an equi-join
+        * between partition keys if this join has been deemed as a partitioned
+        * join. See build_joinrel_partition_info() for reasons.
+        */
+       Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
+
parent_restrictlist));

I suggest removing this assertion. Seems like overkill to me.

+               child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
+
                    child_rel1->relids,
+
                    child_rel2->relids);

It seems like we might end up doing this multiple times for the same
child join, if there are more than 2 tables involved. Not sure if
there's a good way to avoid that. Similarly for child_restrictlist.

+ pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);

Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack
allocation is a lot faster, and then you don't need to pfree it.

+ /* Remove the relabel decoration. */

the -> any, decoration -> decorations

+       /*
+        * Replace the Var nodes of parent with those of children in
expressions.
+        * This function may be called within a temporary context, but the
+        * expressions will be shallow-copied into the plan. Hence copy those in
+        * the planner's context.
+        */

I can't make heads or tails of this comment.

--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -23,7 +23,9 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
+#include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
+#include "optimizer/tlist.h"
 #include "optimizer/var.h"
 #include "parser/parsetree.h"
 #include "utils/lsyscache.h"

Maybe not needed? This is the only hunk in this file? Or should this
be part of one of the later patches?

+       Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel));
+
+       /* Ensure child relation is really what it claims to be. */
+       Assert(IS_OTHER_REL(childrel));

I suggest tightening this up a bit by removing the comment and the
blank line that precedes it.

+       foreach(lc, parentrel->reltarget->exprs)
+       {
+               PlaceHolderVar *phv = lfirst(lc);
+
+               if (IsA(phv, PlaceHolderVar))
+               {
+                       /*
+                        * In case the placeholder Var refers to any
of the parent
+                        * relations, translate it to refer to the
corresponding child.
+                        */
+                       if (bms_overlap(phv->phrels, parentrel->relids) &&
+                               childrel->reloptkind == RELOPT_OTHER_JOINREL)
+                       {
+                               phv = (PlaceHolderVar *)
adjust_appendrel_attrs(root,
+
                                                         (Node *) phv,
+
                                                         nappinfos,
+
                                                         appinfos);
+                       }
+
+                       childrel->reltarget->exprs =
lappend(childrel->reltarget->exprs,
+
                          phv);
+                       phv_added = true;
+               }
+       }

What if the PHV is buried down inside the expression someplace rather
than being at the top level? More generally, why are we not just
applying adjust_appendrel_attrs() to the whole expression?

+       /* Adjust the cost and width of child targetlist. */
+       if (phv_added)
+       {
+               childrel->reltarget->cost.startup =
parentrel->reltarget->cost.startup;
+               childrel->reltarget->cost.per_tuple =
parentrel->reltarget->cost.per_tuple;
+               childrel->reltarget->width = parentrel->reltarget->width;
+       }

Making this conditional on phv_added is probably not saving anything.
Branches are expensive.

                /*
                 * Otherwise, anything in a baserel or joinrel
targetlist ought to be
-                * a Var.  (More general cases can only appear in
appendrel child
-                * rels, which will never be seen here.)
+                * a Var or ConvertRowtypeExpr. For either of those,
find the original
+                * baserel where they originate.
                 */

Hmm, but now we could potentially see an appendrel child rel here, so
don't we need to worry about more general cases? If not, let's
explain why not.

+ * if, it's a ConvertRowtypeExpr, it will be
computed only for the

American usage does not put a comma after if like this (unless you are
writing writing if, for example, blah blah blah -- but there the
commas are to surround for example, not due to the if itself).

+/*
+ * build_joinrel_partition_info
+ *             If the join between given partitioned relations is
possibly partitioned
+ *             set the partitioning scheme and partition keys
expressions for the
+ *             join.
+ *
+ * If the two relations have same partitioning scheme, their join may be
+ * partitioned and will follow the same partitioning scheme as the joining
+ * relations.
+ */

I think you could drop the primary comment block and use the secondary
block as the primary one. That is, get rid of "If the join
between..." and promote "If the two relations...".

+ * The join is not partitioned, if any of the relations being joined are

Another comma that's not typical of American usage.

+ * For an N-way inner join, where every syntactic inner join
has equi-join

has -> has an

+        * For an N-way join with outer joins, where every syntactic join has an
+        * equi-join between partition keys and a matching partitioning scheme,
+        * outer join reordering identities in optimizer/README imply that only
+        * those pairs of join are legal which have an equi-join
between partition
+        * keys. Thus every pair of joining relations we see for this
join should
+        * have an equi-join between partition keys if this join has been deemed
+        * as a partitioned join.

In line 2, partition keys -> the partition keys
In line 3, outer join -> the outer join

"pairs of join" sounds wrong too, although I'm not sure how to reword it.

More broadly: I don't think I understand this comment. The statement
about "those pairs of join are legal which have an equi-join between
partition keys" doesn't match my understanding e.g. A IJ B ON A.x =
B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such
clause syntatically.

Maybe you could replace this whole comment block with something like
this: We can only consider this join as an input to further
partition-wise joins if (a) the input relations are partitioned, (b)
the partition schemes match, and (c) we can identify an equi-join
between the partition keys. Note that if it were possible for
have_partkey_equi_join to return different answers for the same
joinrel depending on which join ordering we try first, this logic
would break. That shouldn't happen, though, because of the way the
query planner deduces implied equalities.

+        * Join relation is partitioned using same partitioning scheme as the
+        * joining relations and has same bounds.

the same partitioning scheme

+        * An INNER join between two partitioned relations is partitioned by key
+        * expressions from both the relations. For tables A and B
partitioned by
+        * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by
+        * both A.a and B.b.
+        *
+        * A SEMI/ANTI join only retains data from the outer side and is
+        * partitioned by the partition keys of the outer side.

I would write: An INNER join between two partitioned relations can be
regarded as partitioned by either key expression. For example, A
INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
B.b; they are equivalent. For a SEMI or ANTI join, the result can
only be regarded as being partitioned in the same manner as the outer
side, since the inner columns are not retained.

+        * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+        * B.b NULL. These rows may not fit the partitioning
conditions imposed on
+        * B.b. Hence, strictly speaking, the join is not partitioned by B.b.

Good.

+        * Strictly speaking, partition keys of an OUTER join should include
+        * partition key expressions from the OUTER side only. Consider a join

I would join this with the previous sentence instead of repeating
strictly speaking: ...and thus the partition keys should include
partition key expressions from the OUTER side only. After that
sentence, I'd skip a lot of the intermediate words here and continue
this way: However, because all commonly-used comparison operators are
strict, the presence of nulls on the outer side doesn't cause any
problem; they can't match anything at future join levels anyway.
Therefore, we track two sets of expressions: those that authentically
partition the relation (partexprs) and those that partition the
relation with the exception that extra nulls may be present
(nullable_partexprs). When the comparison operator is strict, the
latter is just as good as the former.

Then, I think you can omit the rest of what you have; it should be
clear enough what's going on for the full and right cases given that
explanation.

+ * being joined. partexprs and nullable_partexprs are arrays
containing part_scheme->partnatts

Long line, needs reflowing.

I don't think this is too far from being committable. You've done
some nice work here!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#276Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#275)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 4, 2017 at 11:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

+        Enables or disables the query planner's use of partition-wise join
+        plans. When enabled, it spends time in creating paths for joins between
+        partitions and consumes memory to construct expression nodes to be used
+        for those joins, even if partition-wise join does not result in the
+        cheapest path. The time and memory increase exponentially with the
+        number of partitioned tables being joined and they increase linearly
+        with the number of partitions. The default is <literal>off</>.

I think this is too scary and too much technical detail. I think you
could just say something like: Enables or disables use of
partition-wise join, which allows a join between partitioned tables to
be performed by joining the matching partitions. Partition-wise join
currently applies only when the join conditions include all the
columns of the partition keys, which must be of the same data type and
have exactly matching sets of child partitions. Because
partition-wise join planning can use significantly increase CPU time
and memory usage during planning, the default is <literal>off</>.

Not enough caffeine, obviously: should have been something like --
Because partition-wise join can significantly increase the CPU and
memory costs of planning...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#277Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#273)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 4, 2017 at 8:23 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I am not sure whether your assumption that expression with no Vars
would have em_relids empty is correct. I wonder whether we will add
any em_is_child members with empty em_relids; looking at
process_equivalence() those come from RestrictInfo::left/right_relids
which just indicates the relids at which that particular expression
can be evaluated. Place holder vars is an example when that can
happen, but there may be others. To verify this, I tried attached
patch on master and ran make check. The assertion didn't trip. If
em_relids is not NULL, bms_is_subset() is fine.

I spent some more time experimenting with this. I found that cases
where an em_is_child equivalence class contains multiple relids are
quite easy to generate, e.g. select * from foo, bar where foo.a +
bar.a = 0, where foo and bar are partitioned. However, I wasn't able
to generate a case where an em_is_child equivalence class has no
relids at all, and I'm out of ideas about how such a thing could
occur. I suspect it can't. I wondered whether there was some problem
with the multiple-relids case, but I can't find an example where that
misbehaves either. So maybe it's fine (or maybe I'm just not smart
enough to find the case where it breaks).

I don't think I believe that comment, either. In the case from which
that comment was copied (mark_dummy_rel), it was talking about a
RelOptInfo, and geqo_eval() takes care to remove any leftover pointers
to joinrels creating during a GEQO cycle. But there's no similar
logic for ppilist, so I think what will happen here is that you'll end
up with a freed node in the middle of the list.

In mark_dummy_rel() it's not about RelOptInfo, it's about the pathlist
with dummy path being created in the same context as the RelOptInfo.
Same applies here.

Oops. I was thinking that the ppilist was attached to some
planner-global structure, but it's not; it's hanging off the
RelOptInfo. So you're entirely right, and I'm just being dumb.

We need to reparameterize any path which contains further paths and/or
contains expressions that point to the parent relation. For a given
path we need to reparameterize any paths that it contains and
translate any expressions that are specific to that path. Expressions
common across the paths are translated after the switch case. I have
added this rule to the comment just above the switch case
/*
* Copy of the given path. Reparameterize any paths referenced by the given
* path. Replace parent Vars in path specific expressions by corresponding
* child Vars.
*/
Does that look fine or we want to add explanation for every node handled here.

No, I don't think we want something for every node, just a general
explanation at the top of the function. Maybe something like this:

Most fields from the original path can simply be flat-copied, but any
expressions must be adjusted to refer to the correct varnos, and any
paths must be recursively reparameterized. Other fields that refer to
specific relids also need adjustment.

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child(). It's just falling through to
the default case.

I added those cases separately to explain why we should not see those
cases in that switch case. I think that explanation is important
(esp. considering your comment above) and associating those comment
with "case" statement looks better. Are you suggesting that we should
add that explanation in default case?

Or leave the explanation out altogether.

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have to be prepared for that and skip building the path. But that
would be more like what reparameterize_path() does, and it would make
failure to include some relevant path type here a corner-case
performance bug rather than a correctness issue. It seems like
someone adding a new path type could quite easily fail to realize that
it might need to be added here, or might be unsure whether it's
necessary to add it here.

I am OK with that. However reparameterize_path_by_child() and
reparameterize_paths_by_child() are callers of
reparameterize_path_by_child() so they will need to deal with NULL
return. I am fine with that too, but making sure that we are on the
same page. If we do that, we could simply assert that the switch case
doesn't see T_SubqueryScanPath and T_ResultPath.

Or do nothing at all about those cases.

I noticed today that the version of the patchset I have here says in
the header comments for reparameterize_path_by_child() that it returns
NULL if it can't reparameterize, but that's not what it actually does.
If you make this change, the existing comment will become correct.

The problem with the NULL return convention is that it's not very
convenient when this function is recursing. Maybe we should change
this function's signature to be bool
reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel,
Path **path); then you could do, e.g. if
(!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual))
return;

But I don't really like that approach; it's still quite long-winded.
Instead, I suggest Stupid Macro Tricks:

#define ADJUST_CHILD_ATTRS(val) \
val = (List *) adjust_appendrel_attrs_multilevel((Node *) val,
child_rel->relids, child_rel->top_parent_relids);

#define REPARAMETERIZE_CHILD_PATH(val) \
val = reparameterize_path_by_child(root, val, child_rel); \
if (val == NULL) \
return NULL;

#define REPARAMETERIZE_CHILD_PATH_LIST(val) \
if (val != NIL) \
{ \
val = reparameterize_pathlist_by_child(root, val, child_rel); \
if (val == NIL) \
return NULL; \
}

With that, a complicated case like T_NestPath becomes just:

JoinPath *jpath;

FLAT_COPY_PATH(jpath, path, NestPath);
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
new_path = (Path *) jpath;

Now, I admit that hiding stuff inside the macro definitions like that
is ugly. But I think it's still better than repeating boilerplate
code with finnicky internal bits lots of times.

Yes, I too am thinking about the same. The only reason I have EXPLAIN
output there is to check whether partition-wise join is being used or
not. The testcase is not interested in the actual shape. It doesn't
make sense to just test the output if partition-wise join is not used.
May be a function examining the plan tree would help. The function
will have to handle Result/Sort nodes on top and make sure that Append
has join children. Do you have any other idea to check the shape of
the plan tree without the details? Any EXPLAIN switch, existing
functions etc.?

No, not really. We may just need to be prepared to fix whatever breaks.

I think it would be good to have a test case that shows multi-level
partition-wise join working across multiple levels. I wrote the
attached test, which you're welcome to use if you like it, adapt if
you sorta like it, or replace if you dislike it. The table names at
least should be changed to something less likely to duplicate other
tests.

There are tests for multi-level partitioned table in the file. They
test whole partition hierarchy join, part of it being joined based on
the quals. Search for
--
-- multi-leveled partitions
--

Have you looked at those? They test two-level partitioned tables and
your test tests three-level partitioned table. I can modify the tests
to have three levels of partitions and different partition schemes on
different levels. Is that what you expect?

Oops, no, I just missed the test case. I saw the one that said "inner
join, qual covering only top-level partitions" and missed that there
were others later where the quals covered lower levels also.

Instead of "multi-leveled partitions" it might read better to say
"multiple levels of partitioning".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#278Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#274)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 4, 2017 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2017 at 8:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2017 at 8:21 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

About your earlier comment of making build_joinrel_partition_info()
simpler. Right now, the code assumes that partexprs or
nullable_partexpr can be NULL when either of them is not populated.
That may be saves a sizeof(pointer) * (number of keys) byes of memory.
Saving that much memory may not be worth the complexity of code. So,
we may always allocate memory for those arrays and fill it with NIL
values when there are no key expressions to populate those. That will
simplify the code. I haven't done that change in this patchset. I was
busy debugging the Q7 regression. Let me know your comments about
that.

Hmm, I'm not sure that's the best approach, but let me look at it more
carefully before I express a firm opinion.

Having studied this a bit more, I now think your proposed approach is
a good idea.

Thanks. Done.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#279Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#269)
Re: Partition-wise join for join between (declaratively) partitioned tables

Robert Haas wrote:

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

If I understand correctly, what's being used here is the "-wise" suffix,
unrelated to wisdom, which Merriam Webster lists as "adverb combining
form" here https://www.merriam-webster.com/dictionary/wise (though you
have to scroll down a lot), which is defined as

1 a :in the manner of * crabwise * fanwise
b :in the position or direction of * slantwise * clockwise
2 :with regard to :in respect of * dollarwise

According to that, the right way to write this is "partitionwise join"
(no dash), which means "join in respect of partitions", "join with
regard to partitions".

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#280Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#279)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Oct 5, 2017 at 9:48 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

If I understand correctly, what's being used here is the "-wise" suffix,
unrelated to wisdom, which Merriam Webster lists as "adverb combining
form" here https://www.merriam-webster.com/dictionary/wise (though you
have to scroll down a lot), which is defined as

1 a :in the manner of * crabwise * fanwise
b :in the position or direction of * slantwise * clockwise
2 :with regard to :in respect of * dollarwise

According to that, the right way to write this is "partitionwise join"
(no dash), which means "join in respect of partitions", "join with
regard to partitions".

I'm fine with that, if others like it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#281Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Alvaro Herrera (#279)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Oct 5, 2017 at 7:18 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

Regarding nomenclature and my previous griping about wisdom, I was
wondering about just calling this a "partition join" like you have in
the regression test. So the GUC would be enable_partition_join, you'd
have generate_partition_join_paths(), etc. Basically just delete
"wise" throughout.

If I understand correctly, what's being used here is the "-wise" suffix,
unrelated to wisdom, which Merriam Webster lists as "adverb combining
form" here https://www.merriam-webster.com/dictionary/wise (though you
have to scroll down a lot), which is defined as

1 a :in the manner of * crabwise * fanwise
b :in the position or direction of * slantwise * clockwise
2 :with regard to :in respect of * dollarwise

That's right.

According to that, the right way to write this is "partitionwise join"
(no dash), which means "join in respect of partitions", "join with
regard to partitions".

Google lists mostly "partition wise" or "partition-wise" and very
rarely "partitionwise". The first being used in other DBMS literature.
The paper (there aren't many on this subject) I referred [1]https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf uses
"partition-wise". It made more sense to replace " " or "-" with "_"
when syntax doesn't allow the first two. I am not against
"partitionwise" but I don't see any real reason why we should move
away from popular usage of this term.

[1]: https://users.cs.duke.edu/~shivnath/papers/sigmod295-herodotou.pdf

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#282Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#275)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 4, 2017 at 9:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 3, 2017 at 3:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I decided to skip over 0001 for today and spend some time looking at
0002-0006.

Back to 0001.

+        Enables or disables the query planner's use of partition-wise join
+        plans. When enabled, it spends time in creating paths for joins between
+        partitions and consumes memory to construct expression nodes to be used
+        for those joins, even if partition-wise join does not result in the
+        cheapest path. The time and memory increase exponentially with the
+        number of partitioned tables being joined and they increase linearly
+        with the number of partitions. The default is <literal>off</>.

I think this is too scary and too much technical detail. I think you
could just say something like: Enables or disables use of
partition-wise join, which allows a join between partitioned tables to
be performed by joining the matching partitions. Partition-wise join
currently applies only when the join conditions include all the
columns of the partition keys, which must be of the same data type and
have exactly matching sets of child partitions. Because
partition-wise join planning can use significantly increase CPU time
and memory usage during planning, the default is <literal>off</>.

Done. With slight change. "include all the columns of the partition
keys" has a different meaning when partition key is an expression, so
I have used "include all the partition keys". Also changed the last
sentence as "... can use significantly more CPU time and memory during
planning ...". Please feel free to revert those changes, if you don't
like them.

+partitioned table. The join partners can not be found in other partitions. This
+condition allows the join between partitioned tables to be broken into joins
+between the matching partitions. The resultant join is partitioned in the same

"The join partners can not be found in other partitions." is redundant
with the previous sentence. I suggest deleting it. I also suggest
"This condition allows the join between partitioned tables to be
broken" -> "Because of this, the join between partitioned tables can
be broken".

Done.

+relation" for both partitioned table as well as join between partitioned tables
+which can use partition-wise join technique.

for either a partitioned table or a join between compatibly partitioned tables

Done.

+Partitioning properties of a partitioned relation are stored in
+PartitionSchemeData structure. Planner maintains a list of canonical partition
+schemes (distinct PartitionSchemeData objects) so that any two partitioned
+relations with same partitioning scheme share the same PartitionSchemeData
+object. This reduces memory consumed by PartitionSchemeData objects and makes
+it easy to compare the partition schemes of joining relations.

Not all of the partitioning properties are stored in the
PartitionSchemeData structure any more. I think this needs some
rethinking and maybe some expansion. As written, each of the first
two sentences needs a "the" at the beginning.

Changed to

The partitioning properties of a partitioned relation are stored in its
RelOptInfo. The information about data types of partition keys are stored in
PartitionSchemeData structure. The planner maintains a list of canonical
partition schemes (distinct PartitionSchemeData objects) so that RelOptInfo of
any two partitioned relations with same partitioning scheme point to the same
PartitionSchemeData object. This reduces memory consumed by
PartitionSchemeData objects and makes it easy to compare the partition schemes
of joining relations.

Let me know if this looks good.

+                               /*
+                                * Create "append" paths for
partitioned joins. Do this before
+                                * creating GatherPaths so that
partial "append" paths in
+                                * partitioned joins will be considered.
+                                */

I think you could shorten this to a single-line comment and just keep
the first sentence. Similarly in the other location where you have
the same sort of thing.

Done.

+ * child-joins. Otherwise, add_path might delete a path that some "append"
+ * path has reference to.

to which some path generated here has a reference.

Done.

Here and elsewhere, you use "append" rather than Append to refer to
the paths added. I suppose that's weasel-wording to work around the
fact that they might be either Append or MergeAppend paths, but I'm
not sure it's really going to convey that to anyone. I suggest
rephrasing those comments more generically, e.g.:

+ /* Add "append" paths containing paths from child-joins. */

You could say: Build additional paths for this rel from child-join paths.

Or something.

Done. Removed word "append" from the comments in merge_clump(),
standard_join_search() and prologue of
generate_partition_wise_join_paths(). Changed the last comment as per
your suggestion.

+       if (!REL_HAS_ALL_PART_PROPS(rel))
+               return;
Isn't this an unnecessarily expensive test?  I mean, it shouldn't be
possible for it to have some arbitrary subset.

All this function cares about is whether the given relation has any
partitions which can be simply checked by rel->nparts > 0 and
rel->part_rels != NULL. We need to explicitly check part_rels because
an outer join which has empty inner side in every pair will have
part_scheme, partbounds, nparts all set, but not part_rels. See
relevant comments in try_partition_wise_join() for more details. I
have now replaced macro with checks on rel->nparts and rel->part_rels.
This would change with the last patch dealing with partition-wise join
involving dummy relations. Once we have that an outer join like above
will also have part_rels set. But even then I think checking for
part_rels and nparts makes more sense than part_scheme and partbounds.

+       /*
+        * Every pair of joining relations we see here should have an equi-join
+        * between partition keys if this join has been deemed as a partitioned
+        * join. See build_joinrel_partition_info() for reasons.
+        */
+       Assert(have_partkey_equi_join(rel1, rel2, parent_sjinfo->jointype,
+
parent_restrictlist));

I suggest removing this assertion. Seems like overkill to me.

I thought it was good to have there to catch any bug breaking that
rule. But I have removed it as per your suggestion.
Do you think we should remove following assertions as well?
/*
* Since we allow partition-wise join only when the partition bounds of
* the joining relations exactly match, the partition bounds of the join
* should match those of the joining relations.
*/
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
joinrel->part_scheme->parttyplen,
joinrel->part_scheme->parttypbyval,
joinrel->boundinfo, rel1->boundinfo));
Assert(partition_bounds_equal(joinrel->part_scheme->partnatts,
joinrel->part_scheme->parttyplen,
joinrel->part_scheme->parttypbyval,
joinrel->boundinfo, rel2->boundinfo));

+               child_sjinfo = build_child_join_sjinfo(root, parent_sjinfo,
+
child_rel1->relids,
+
child_rel2->relids);

It seems like we might end up doing this multiple times for the same
child join, if there are more than 2 tables involved. Not sure if
there's a good way to avoid that.

IIUC every pair of joining relations will use a different sjinfo, A
LEFT JOIN B LEFT JOIN C will have two sjinfos one for AB and other for
BC. For ABC we will use the one for AB to join A with BC and we will
use one for BC to join AB with C. I agree that we are building sjinfo
for AB twice once for joining AB and then for A(BC). In order to avoid
that we will have to somehow link the parent sjinfo with child sjinfo
and avoid translating parent sjinfo again and again. May be the parent
sjinfo can contain a cache of child sjinfos.Do we want to do that in
this patch set? We could avoid translation entirely, if we could use
parent sjinfo for joining children. But that's a pretty deep surgery.

Similarly for child_restrictlist.

Similary for restrictlist. Every joining pair has a different
restrictlist. Otherwise, we would have saved restrictlist in the
joinrel itself.

+ pk_has_clause = (bool *) palloc0(sizeof(bool) * num_pks);

Just do bool pk_has_clause[PARTITION_MAX_KEYS] instead. Stack
allocation is a lot faster, and then you don't need to pfree it.

That's a good idea. Done.

+ /* Remove the relabel decoration. */

the -> any, decoration -> decorations

Done.

+       /*
+        * Replace the Var nodes of parent with those of children in
expressions.
+        * This function may be called within a temporary context, but the
+        * expressions will be shallow-copied into the plan. Hence copy those in
+        * the planner's context.
+        */

I can't make heads or tails of this comment.

haha! My bad. the second sentence is something left of the code where
the child-joins used to be planned in a temporary memory context.
That's not true any more. Removed the entire comment.

--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -23,7 +23,9 @@
#include "optimizer/pathnode.h"
#include "optimizer/paths.h"
#include "optimizer/planmain.h"
+#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "optimizer/tlist.h"
#include "optimizer/var.h"
#include "parser/parsetree.h"
#include "utils/lsyscache.h"

Maybe not needed? This is the only hunk in this file? Or should this
be part of one of the later patches?

I think 0005. Sorry. I will move it there.

+       Assert(IS_JOIN_REL(childrel) && IS_JOIN_REL(parentrel));
+
+       /* Ensure child relation is really what it claims to be. */
+       Assert(IS_OTHER_REL(childrel));

I suggest tightening this up a bit by removing the comment and the
blank line that precedes it.

Done.

+       foreach(lc, parentrel->reltarget->exprs)
+       {
+               PlaceHolderVar *phv = lfirst(lc);
+
+               if (IsA(phv, PlaceHolderVar))
+               {
+                       /*
+                        * In case the placeholder Var refers to any
of the parent
+                        * relations, translate it to refer to the
corresponding child.
+                        */
+                       if (bms_overlap(phv->phrels, parentrel->relids) &&
+                               childrel->reloptkind == RELOPT_OTHER_JOINREL)
+                       {
+                               phv = (PlaceHolderVar *)
adjust_appendrel_attrs(root,
+
(Node *) phv,
+
nappinfos,
+
appinfos);
+                       }
+
+                       childrel->reltarget->exprs =
lappend(childrel->reltarget->exprs,
+
phv);
+                       phv_added = true;
+               }
+       }

What if the PHV is buried down inside the expression someplace rather
than being at the top level?

That can't happen. See add_placeholders_to_joinrel(), which adds these
placeholders to joinrel's target. That function adds PHVs as bare
nodes, not embedded into something else.

More generally, why are we not just
applying adjust_appendrel_attrs() to the whole expression?

Usually targetlists of join have Var nodes which bubble up from the
base relations. Even PHVs bubble up from the lowest join where they
can be evaluated. If we translate reltarget, we will allocate new Var
nodes for every join relation consuming more memory and then setrefs
will need to compare the contents of those nodes instead of just
pointer comparison. We use this code and attr_needed to avoid memory
consumption and setref's CPU consumption.

+       /* Adjust the cost and width of child targetlist. */
+       if (phv_added)
+       {
+               childrel->reltarget->cost.startup =
parentrel->reltarget->cost.startup;
+               childrel->reltarget->cost.per_tuple =
parentrel->reltarget->cost.per_tuple;
+               childrel->reltarget->width = parentrel->reltarget->width;
+       }

Making this conditional on phv_added is probably not saving anything.
Branches are expensive.

Ok.

If there are not PHVs in the query i.e. when root->placeholders_list
is NIL, we don't need to scan reltarget->exprs. I have added that
optimization.

/*
* Otherwise, anything in a baserel or joinrel
targetlist ought to be
-                * a Var.  (More general cases can only appear in
appendrel child
-                * rels, which will never be seen here.)
+                * a Var or ConvertRowtypeExpr. For either of those,
find the original
+                * baserel where they originate.
*/

Hmm, but now we could potentially see an appendrel child rel here, so
don't we need to worry about more general cases? If not, let's
explain why not.

By more general cases, that comment means ConvertRowtypeExpr or
RowExpr, nothing else. A base relation's tlist can have only Var nodes
when it reaches this comment. When a parent Var node is subjected to
adjust_appendrel_attrs() it is translated to a Var node for all
varattnos except 0, which indicates a whole-row var. For a child
table, a whole-row var is always a named row type and hence gets
translated to a ConvertRowExpr. Other kinds of children (subqueries in
union etc.) can not appear here since they do not participate in a
join directly. So it's really a Var and ConvertRowtypeExpr. I have
modified the comment to explain this.

+ * if, it's a ConvertRowtypeExpr, it will be
computed only for the

American usage does not put a comma after if like this (unless you are
writing writing if, for example, blah blah blah -- but there the
commas are to surround for example, not due to the if itself).

That comma was unintentional. Removed.

+/*
+ * build_joinrel_partition_info
+ *             If the join between given partitioned relations is
possibly partitioned
+ *             set the partitioning scheme and partition keys
expressions for the
+ *             join.
+ *
+ * If the two relations have same partitioning scheme, their join may be
+ * partitioned and will follow the same partitioning scheme as the joining
+ * relations.
+ */

I think you could drop the primary comment block and use the secondary
block as the primary one. That is, get rid of "If the join
between..." and promote "If the two relations...".

Done.

+ * The join is not partitioned, if any of the relations being joined are

Another comma that's not typical of American usage.

Done.

+ * For an N-way inner join, where every syntactic inner join
has equi-join

has -> has an

+        * For an N-way join with outer joins, where every syntactic join has an
+        * equi-join between partition keys and a matching partitioning scheme,
+        * outer join reordering identities in optimizer/README imply that only
+        * those pairs of join are legal which have an equi-join
between partition
+        * keys. Thus every pair of joining relations we see for this
join should
+        * have an equi-join between partition keys if this join has been deemed
+        * as a partitioned join.

In line 2, partition keys -> the partition keys
In line 3, outer join -> the outer join

"pairs of join" sounds wrong too, although I'm not sure how to reword it.

More broadly: I don't think I understand this comment. The statement
about "those pairs of join are legal which have an equi-join between
partition keys" doesn't match my understanding e.g. A IJ B ON A.x =
B.x LJ C ON A.x = C.x surely allows a B-C join, but there's no such
clause syntatically.

Maybe you could replace this whole comment block with something like
this: We can only consider this join as an input to further
partition-wise joins if (a) the input relations are partitioned, (b)
the partition schemes match, and (c) we can identify an equi-join
between the partition keys. Note that if it were possible for
have_partkey_equi_join to return different answers for the same
joinrel depending on which join ordering we try first, this logic
would break. That shouldn't happen, though, because of the way the
query planner deduces implied equalities.

Hmm. I meant the second para to be read in the context of the first.
Since AB is inner join A.x and B.x are replaceable (I forgot the
correct term, identity?) and thus A.x = C.x implies B.x = C.x thus
allowing join BC. But I think your version of the comment is easy to
understand. But I think it should also refer to the way planner
reorders joins; that's what causes us to worry about every join order
being partitioned. I think we should redirect a reader, who wants to
understand more about implied equalities and join orders, to
optimizer/README. So, I have changed the last sentence to read "That
shouldn't happen, though, because of the way the query planner deduces
implied equalities and reorders joins. See optimizer/README for
details." If you don't like my changes, please feel free to drop
those.

In the code block following this comment, I have used shorter variable
names instead of accurate but long ones. E.g. outer_expr should have
been outer_partexpr and outer_null_expr should have been
outer_nullable_partexpr. Please feel free to change those if you don't
like them or let me know if you have any better ideas and I will
update the patch with those ideas.

+        * Join relation is partitioned using same partitioning scheme as the
+        * joining relations and has same bounds.

the same partitioning scheme

Done.

+        * An INNER join between two partitioned relations is partitioned by key
+        * expressions from both the relations. For tables A and B
partitioned by
+        * a and b respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by
+        * both A.a and B.b.
+        *
+        * A SEMI/ANTI join only retains data from the outer side and is
+        * partitioned by the partition keys of the outer side.

I would write: An INNER join between two partitioned relations can be
regarded as partitioned by either key expression. For example, A
INNER JOIN B ON A.a = B.b can be regarded as partitioned on A.a or on
B.b; they are equivalent. For a SEMI or ANTI join, the result can
only be regarded as being partitioned in the same manner as the outer
side, since the inner columns are not retained.

Done.

+        * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+        * B.b NULL. These rows may not fit the partitioning
conditions imposed on
+        * B.b. Hence, strictly speaking, the join is not partitioned by B.b.

Good.

+        * Strictly speaking, partition keys of an OUTER join should include
+        * partition key expressions from the OUTER side only. Consider a join

I would join this with the previous sentence instead of repeating
strictly speaking: ...and thus the partition keys should include
partition key expressions from the OUTER side only. After that
sentence, I'd skip a lot of the intermediate words here and continue
this way: However, because all commonly-used comparison operators are
strict, the presence of nulls on the outer side doesn't cause any
problem; they can't match anything at future join levels anyway.
Therefore, we track two sets of expressions: those that authentically
partition the relation (partexprs) and those that partition the
relation with the exception that extra nulls may be present
(nullable_partexprs). When the comparison operator is strict, the
latter is just as good as the former.

Then, I think you can omit the rest of what you have; it should be
clear enough what's going on for the full and right cases given that
explanation.

I liked this version. Changed the comments as per your suggestions.

+ * being joined. partexprs and nullable_partexprs are arrays
containing part_scheme->partnatts

Long line, needs reflowing.

Done. Also fixed a grammatical mistake: contains -> contain in the
last line of that paragraph.

I don't think this is too far from being committable. You've done
some nice work here!

Thanks a lot for your detailed reviews and guidance. I will post the
updated patchset with my next reply.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#283Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#277)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Oct 5, 2017 at 12:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:

We need to reparameterize any path which contains further paths and/or
contains expressions that point to the parent relation. For a given
path we need to reparameterize any paths that it contains and
translate any expressions that are specific to that path. Expressions
common across the paths are translated after the switch case. I have
added this rule to the comment just above the switch case
/*
* Copy of the given path. Reparameterize any paths referenced by the given
* path. Replace parent Vars in path specific expressions by corresponding
* child Vars.
*/
Does that look fine or we want to add explanation for every node handled here.

No, I don't think we want something for every node, just a general
explanation at the top of the function. Maybe something like this:

Most fields from the original path can simply be flat-copied, but any
expressions must be adjusted to refer to the correct varnos, and any
paths must be recursively reparameterized. Other fields that refer to
specific relids also need adjustment.

Done.

I don't see much point in the T_SubqueryScanPath and T_ResultPath
cases in reparameterize_path_by_child(). It's just falling through to
the default case.

I added those cases separately to explain why we should not see those
cases in that switch case. I think that explanation is important
(esp. considering your comment above) and associating those comment
with "case" statement looks better. Are you suggesting that we should
add that explanation in default case?

Or leave the explanation out altogether.

Ok. Removed the explanation and the cases.

I wonder if reparameterize_path_by_child() ought to default to
returning NULL rather than throwing an error; the caller would then
have to be prepared for that and skip building the path. But that
would be more like what reparameterize_path() does, and it would make
failure to include some relevant path type here a corner-case
performance bug rather than a correctness issue. It seems like
someone adding a new path type could quite easily fail to realize that
it might need to be added here, or might be unsure whether it's
necessary to add it here.

I am OK with that. However reparameterize_path_by_child() and
reparameterize_paths_by_child() are callers of
reparameterize_path_by_child() so they will need to deal with NULL
return. I am fine with that too, but making sure that we are on the
same page. If we do that, we could simply assert that the switch case
doesn't see T_SubqueryScanPath and T_ResultPath.

Or do nothing at all about those cases.

I noticed today that the version of the patchset I have here says in
the header comments for reparameterize_path_by_child() that it returns
NULL if it can't reparameterize, but that's not what it actually does.
If you make this change, the existing comment will become correct.

The problem with the NULL return convention is that it's not very
convenient when this function is recursing. Maybe we should change
this function's signature to be bool
reparameterize_path_by_child(PlannerInfo *root, RelOptInfo *child_rel,
Path **path); then you could do, e.g. if
(!reparameterize_path_by_child(root, child_rel, &bhpath->bitmapqual))
return;

But I don't really like that approach; it's still quite long-winded.
Instead, I suggest Stupid Macro Tricks:

#define ADJUST_CHILD_ATTRS(val) \
val = (List *) adjust_appendrel_attrs_multilevel((Node *) val,
child_rel->relids, child_rel->top_parent_relids);

It so happens that every node we subject to
adjust_appendrel_attrs_multilevel is List, so this is ok. In case we
need to adjust some other type of node in future, we will pass node
type too. For now, I have used the macro with (List *) hardcoded
there. Do we write the whole macro on the same line even if it
overflows? I see that being done for CONSIDER_PATH_STARTUP_COST
defined in the same file and you also seem to suggest the same. But
macros at other places are indented. For now, I have indented the
macro on multiple lines.

#define REPARAMETERIZE_CHILD_PATH(val) \
val = reparameterize_path_by_child(root, val, child_rel); \
if (val == NULL) \
return NULL;

#define REPARAMETERIZE_CHILD_PATH_LIST(val) \
if (val != NIL) \
{ \
val = reparameterize_pathlist_by_child(root, val, child_rel); \
if (val == NIL) \
return NULL; \
}

I added do { } while (0) around these code blocks like other places.
Please feel free to remove it if you don't think that's not needed.

With that, a complicated case like T_NestPath becomes just:

JoinPath *jpath;

FLAT_COPY_PATH(jpath, path, NestPath);
REPARAMETERIZE_CHILD_PATH(jpath->outerjoinpath);
REPARAMETERIZE_CHILD_PATH(jpath->innerjoinpath);
ADJUST_CHILD_ATTRS(jpath->joinrestrictinfo);
new_path = (Path *) jpath;

Now, I admit that hiding stuff inside the macro definitions like that
is ugly. But I think it's still better than repeating boilerplate
code with finnicky internal bits lots of times.

I too do not like hiding stuff under macros since that make debugging
hard, but with these macros code looks really elegant. Thanks for the
suggestion.

Also fixed some lines overflowing character limit.

Yes, I too am thinking about the same. The only reason I have EXPLAIN
output there is to check whether partition-wise join is being used or
not. The testcase is not interested in the actual shape. It doesn't
make sense to just test the output if partition-wise join is not used.
May be a function examining the plan tree would help. The function
will have to handle Result/Sort nodes on top and make sure that Append
has join children. Do you have any other idea to check the shape of
the plan tree without the details? Any EXPLAIN switch, existing
functions etc.?

No, not really. We may just need to be prepared to fix whatever breaks.

Sure.

Instead of "multi-leveled partitions" it might read better to say
"multiple levels of partitioning".

Done.

Here's updated set of patches, rebased on top of the latest head.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v36.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v36.tar.gz
#284Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#283)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 6, 2017 at 5:09 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's updated set of patches, rebased on top of the latest head.

In this patchset reparameterize_pathlist_by_child() ignores NULL
return from reparameterize_path_by_child(). Fixed that in the attached
patchset.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v37.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v37.tar.gz
#285Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Ashutosh Bapat (#284)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 6, 2017 at 5:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Oct 6, 2017 at 5:09 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Here's updated set of patches, rebased on top of the latest head.

In this patchset reparameterize_pathlist_by_child() ignores NULL
return from reparameterize_path_by_child(). Fixed that in the attached
patchset.

Sorry. I sent a wrong file. Here's the real v37.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_dp_join_patches_v37.tar.gzapplication/x-gzip; name=pg_dp_join_patches_v37.tar.gz
#286Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#285)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Sorry. I sent a wrong file. Here's the real v37.

Committed 0001-0006. I made some assorted comment and formatting
changes and two small substantive changes:

- In try_nestloop_path, bms_free(outerrelids) before returning if we
can't reparameterize.

- Moved the call to try_partition_wise_join inside
populate_joinrel_with_paths, instead of always calling it just after
that function is called.

I think this is very good work and I'm excited about the feature. Now
I'll wait to see whether the buildfarm, or Tom, yell at me for
whatever problems this may still have...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#287Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#286)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 6, 2017 at 8:40 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Sorry. I sent a wrong file. Here's the real v37.

Committed 0001-0006. I made some assorted comment and formatting
changes and two small substantive changes:

- In try_nestloop_path, bms_free(outerrelids) before returning if we
can't reparameterize.

Hmm. I missed that.

- Moved the call to try_partition_wise_join inside
populate_joinrel_with_paths, instead of always calling it just after
that function is called.

This looks good too.

I think this is very good work and I'm excited about the feature.

Thanks a lot Robert for detailed review and guidance. Thanks a lot
Rafia for benchmarking the feature with TPCH and esp. very large scale
database and also for testing and reported some real issues. Thanks
Rajkumar for testing it with an exhaustive testset. Thanks Amit
Langote, Thomas Munro, Dilip Kumar, Antonin Houska, Alvaro Herrera and
Amit Khandekar for their review comments and suggestions. Thanks
Jeevan Chalke, who used the patchset to implement partition-wise
aggregates and provided some insights offlist. Sorry if I have missed
anybody.

As Robert says in the commit message, there's more to do but now that
we have basic feature, improving it incrementally becomes a lot
easier.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#288Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#286)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think this is very good work and I'm excited about the feature. Now
I'll wait to see whether the buildfarm, or Tom, yell at me for
whatever problems this may still have...

Buildfarm animal prion turned red. Before going into that failure,
good news is that the other animals are green. So the plans are
stable.

prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
destroys a relcache entry as soon as its reference count drops down to
0. This destroys everything that's there in corresponding relcache
entry including partition key information and partition descriptor
information. find_partition_scheme() and set_relation_partition_info()
both assume that the relcache information will survive as long as the
relation lock is held. They do not copy the relevant partitioning
information but just copy the pointers. That assumption is wrong.
Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
zero, the data in partition scheme and partition bounds goes invalid
and various checks to see if partition wise join is possible fail.
That causes partition_join test to fail on prion. But I think, the bug
could cause crash as well.

The fix is to copy the relevant partitioning information from relcache
into PartitionSchemeData and RelOptInfo. Here's a quick patch with
that fix.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pwj_copy_partinfo.patchtext/x-patch; charset=US-ASCII; name=pwj_copy_partinfo.patch
#289Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#288)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Fri, Oct 6, 2017 at 3:07 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Oct 6, 2017 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think this is very good work and I'm excited about the feature. Now
I'll wait to see whether the buildfarm, or Tom, yell at me for
whatever problems this may still have...

Buildfarm animal prion turned red. Before going into that failure,
good news is that the other animals are green. So the plans are
stable.

prion runs the regression with -DRELCACHE_FORCE_RELEASE, which
destroys a relcache entry as soon as its reference count drops down to
0. This destroys everything that's there in corresponding relcache
entry including partition key information and partition descriptor
information. find_partition_scheme() and set_relation_partition_info()
both assume that the relcache information will survive as long as the
relation lock is held. They do not copy the relevant partitioning
information but just copy the pointers. That assumption is wrong.
Because of -DRELCACHE_FORCE_RELEASE, as soon as refcount drops to
zero, the data in partition scheme and partition bounds goes invalid
and various checks to see if partition wise join is possible fail.
That causes partition_join test to fail on prion. But I think, the bug
could cause crash as well.

The fix is to copy the relevant partitioning information from relcache
into PartitionSchemeData and RelOptInfo. Here's a quick patch with
that fix.

Committed. I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#290Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#289)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Support-partition-wise-join-for-dummy-partitioned-re.patchtext/x-patch; charset=US-ASCII; name=0001-Support-partition-wise-join-for-dummy-partitioned-re.patch
#291Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#290)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().

Wouldn't a far more general approach be to allow a partition-wise join
between a partitioned table and an unpartitioned table, considering
the result as partitioned? That seems like it would very often yield
much better query plans than what we have right now, and also make the
need for this particular thing go away.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#292Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#291)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 11, 2017 at 7:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 9, 2017 at 2:05 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Committed. I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().

Wouldn't a far more general approach be to allow a partition-wise join
between a partitioned table and an unpartitioned table, considering
the result as partitioned? That seems like it would very often yield
much better query plans than what we have right now, and also make the
need for this particular thing go away.

You are suggesting that a dummy partitioned table be treated as an
un-partitioned table and apply above suggested optimization. A join
between a partitioned and unpartitioned table is partitioned by the
keys of only partitioned table. An unpartitioned table doesn't have
any keys, so this is fine. But a dummy partitioned table does have
keys. Recording them as keys of the join relation helps when it joins
to other relations. Furthermore a join between partitioned and
unpartitioned table doesn't require any equi-join condition on
partition keys of partitioned table but a join between partitioned
tables is considered to be partitioned by keys on both sides only when
there is an equi-join. So, when implementing a partitioned join
between a partitioned and an unpartitioned table, we will have to make
a special case to record partition keys when the unpartitioned side is
actually a dummy partitioned table. That might be awkward.

Because we don't have dummy children relation in all cases, we already
have some awkwardness like allocating part_rels array only when we
encounter a join order which has all the children. This patch removes
that.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#293Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#292)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

You are suggesting that a dummy partitioned table be treated as an
un-partitioned table and apply above suggested optimization. A join
between a partitioned and unpartitioned table is partitioned by the
keys of only partitioned table. An unpartitioned table doesn't have
any keys, so this is fine. But a dummy partitioned table does have
keys. Recording them as keys of the join relation helps when it joins
to other relations. Furthermore a join between partitioned and
unpartitioned table doesn't require any equi-join condition on
partition keys of partitioned table but a join between partitioned
tables is considered to be partitioned by keys on both sides only when
there is an equi-join. So, when implementing a partitioned join
between a partitioned and an unpartitioned table, we will have to make
a special case to record partition keys when the unpartitioned side is
actually a dummy partitioned table. That might be awkward.

It seems to me that what we really need here is to move all of this
stuff into a separate struct:

/* used for partitioned relations */
PartitionScheme part_scheme; /* Partitioning scheme. */
int nparts; /* number of
partitions */
struct PartitionBoundInfoData *boundinfo; /* Partition bounds */
struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions,

* stored in the same order of bounds */
List **partexprs; /* Non-nullable partition key
expressions. */
List **nullable_partexprs; /* Nullable partition key
expressions. */

...and then have a RelOptInfo carry a pointer to a list of those
structures. That lets us consider multiple possible partition schemes
for the same relation. For instance, suppose that a user joins four
relations, P1, P2, Q1, and Q2. P1 and P2 are compatibly partitioned.
Q1 and Q2 are compatibly partitioned (but not compatible with P1 and
P2).

Furthermore, let's suppose that the optimal join order begins with a
join between P1 and Q1. When we construct the paths for that joinrel,
we can either join all of P1 to all of Q1 (giving up on partition-wise
join), or we can join each partition of P1 to all of Q1 (producing a
result partitioned compatibly with P1 and allowing for a future
partition-wise join to P2), or we can join each partition of Q1 to all
of P1 (producing a result partitioned compatibly with Q1 and allowing
for a future partition-wise join to Q2). Any of those could win
depending on the details. With the data structure as it is today,
we'd have to choose whether to mark the joinrel as partitioned like P1
or like Q1, but that's not really what we need here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#294Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#293)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Thu, Oct 12, 2017 at 10:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 11, 2017 at 10:43 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

You are suggesting that a dummy partitioned table be treated as an
un-partitioned table and apply above suggested optimization. A join
between a partitioned and unpartitioned table is partitioned by the
keys of only partitioned table. An unpartitioned table doesn't have
any keys, so this is fine. But a dummy partitioned table does have
keys. Recording them as keys of the join relation helps when it joins
to other relations. Furthermore a join between partitioned and
unpartitioned table doesn't require any equi-join condition on
partition keys of partitioned table but a join between partitioned
tables is considered to be partitioned by keys on both sides only when
there is an equi-join. So, when implementing a partitioned join
between a partitioned and an unpartitioned table, we will have to make
a special case to record partition keys when the unpartitioned side is
actually a dummy partitioned table. That might be awkward.

It seems to me that what we really need here is to move all of this
stuff into a separate struct:

/* used for partitioned relations */
PartitionScheme part_scheme; /* Partitioning scheme. */
int nparts; /* number of
partitions */
struct PartitionBoundInfoData *boundinfo; /* Partition bounds */
struct RelOptInfo **part_rels; /* Array of RelOptInfos of partitions,

* stored in the same order of bounds */
List **partexprs; /* Non-nullable partition key
expressions. */
List **nullable_partexprs; /* Nullable partition key
expressions. */

In a very early patch I had PartitionOptInfo to hold all of this.
RelOptInfo then had a pointer of PartitionOptInfo, if it was
partitioned. When a relation can be partitioned in multiple ways like
what you describe or because join by re-partitioning is efficient,
RelOptInfo would have a list of those. But the representation needs to
be thought through. I am wondering whether this should be modelled
like IndexOptInfo. I am not sure. This is a topic of much larger
discussion.

I think we are digressing. We were discussing my patch to handle dummy
partitioned relation, whose children are not marked dummy and do not
have pathlists set. Do you still think that we should leave that
aside?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#295Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#289)
1 attachment(s)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The fix is to copy the relevant partitioning information from relcache
into PartitionSchemeData and RelOptInfo. Here's a quick patch with
that fix.

Committed. I hope that makes things less red rather than more,
because I'm going to be AFK for a few hours anyway.

set_append_rel_size() crashes when it encounters a partitioned table
with a dropped column. Dropped columns do not have any translations
saved in AppendInfo::translated_vars; the corresponding entry is NULL
per make_inh_translation_list().
1802 att = TupleDescAttr(old_tupdesc, old_attno);
1803 if (att->attisdropped)
1804 {
1805 /* Just put NULL into this list entry */
1806 vars = lappend(vars, NULL);
1807 continue;
1808 }

In set_append_rel_size() we try to attr_needed for child tables. While
doing so we try to translate a user attribute number of parent to that
of a child and crash since the translated Var is NULL. Here's patch to
fix the crash. The patch also contains a testcase to test dropped
columns in partitioned table.

Sorry for not noticing this problem earlier.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

0001-Ignore-dropped-columns-in-set_append_rel_size.patchtext/x-patch; charset=US-ASCII; name=0001-Ignore-dropped-columns-in-set_append_rel_size.patch
#296Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Ashutosh Bapat (#295)
Re: Partition-wise join for join between (declaratively) partitioned tables

On Mon, Oct 16, 2017 at 5:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

set_append_rel_size() crashes when it encounters a partitioned table
with a dropped column. Dropped columns do not have any translations
saved in AppendInfo::translated_vars; the corresponding entry is NULL
per make_inh_translation_list().
1802 att = TupleDescAttr(old_tupdesc, old_attno);
1803 if (att->attisdropped)
1804 {
1805 /* Just put NULL into this list entry */
1806 vars = lappend(vars, NULL);
1807 continue;
1808 }

In set_append_rel_size() we try to attr_needed for child tables. While
doing so we try to translate a user attribute number of parent to that
of a child and crash since the translated Var is NULL. Here's patch to
fix the crash. The patch also contains a testcase to test dropped
columns in partitioned table.

Sorry for not noticing this problem earlier.

OK, committed. This is a good example of how having good code
coverage doesn't necessarily mean you've found all the bugs. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#297Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Robert Haas (#296)
1 attachment(s)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, committed. This is a good example of how having good code

coverage doesn't necessarily mean you've found all the bugs. :-)

As of now partition_join.sql is not having test cases covering cases
where partition table have default partition, attaching a small test
case patch to cover those.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

partition_wise_join_with_default_partitions.patchtext/x-patch; charset=US-ASCII; name=partition_wise_join_with_default_partitions.patch
#298Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Rajkumar Raghuwanshi (#297)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, committed. This is a good example of how having good code

coverage doesn't necessarily mean you've found all the bugs. :-)

As of now partition_join.sql is not having test cases covering cases
where partition table have default partition, attaching a small test
case patch to cover those.

That's not that small, and to me it looks like overkill.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#299Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Robert Haas (#298)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Sat, Dec 2, 2017 at 2:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 1, 2017 at 1:36 AM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Tue, Oct 31, 2017 at 2:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, committed. This is a good example of how having good code

coverage doesn't necessarily mean you've found all the bugs. :-)

As of now partition_join.sql is not having test cases covering cases
where partition table have default partition, attaching a small test
case patch to cover those.

That's not that small, and to me it looks like overkill.

I agree, the patch looks longer than expected. I think, it's important
to have some testcases to test partition-wise join with default
partitions. I think we need at least one test for range default
partitions, one test for list partitioning, one for multi-level
partitioning and one negative testcase with default partition missing
from one side of join.

May be we could reduce the number of SQL commands and queries in the
patch by adding default partition to every table that participates in
partition-wise join (leave the tables participating in negative tests
aside.). But that's going to increase the size of EXPLAIN outputs and
query results. The negative test may simply drop the default partition
from one of the tables.

For every table being tested, the patch adds two ALTER TABLE commands,
one for detaching an existing partition and then attach the same as
default partition. Alternative to that is just add a new default
partition without detaching and existing partition. But then the
default partition needs to populated with some data, which requires 1
INSERT statement at least. That doesn't reduce the size of patch, but
increases the output of query and EXPLAIN plan.

May be in case of multi-level partitioning test, we don't need to add
DEFAULT in every partitioned relation; adding to one of them would be
enough. May be add it to the parent, but that too can be avoided. That
would reduce the size of patch a bit.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#300Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#299)
1 attachment(s)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I agree, the patch looks longer than expected. I think, it's important
to have some testcases to test partition-wise join with default
partitions. I think we need at least one test for range default
partitions, one test for list partitioning, one for multi-level
partitioning and one negative testcase with default partition missing
from one side of join.

May be we could reduce the number of SQL commands and queries in the
patch by adding default partition to every table that participates in
partition-wise join (leave the tables participating in negative tests
aside.). But that's going to increase the size of EXPLAIN outputs and
query results. The negative test may simply drop the default partition
from one of the tables.

For every table being tested, the patch adds two ALTER TABLE commands,
one for detaching an existing partition and then attach the same as
default partition. Alternative to that is just add a new default
partition without detaching and existing partition. But then the
default partition needs to populated with some data, which requires 1
INSERT statement at least. That doesn't reduce the size of patch, but
increases the output of query and EXPLAIN plan.

May be in case of multi-level partitioning test, we don't need to add
DEFAULT in every partitioned relation; adding to one of them would be
enough. May be add it to the parent, but that too can be avoided. That
would reduce the size of patch a bit.

Thanks Ashutosh for suggestions.

I have reduced test cases as suggested. Attaching updated patch.

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Attachments:

partition_wise_join_with_default_partitions_v1.patchtext/x-patch; charset=US-ASCII; name=partition_wise_join_with_default_partitions_v1.patch
#301Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#300)
1 attachment(s)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I agree, the patch looks longer than expected. I think, it's important
to have some testcases to test partition-wise join with default
partitions. I think we need at least one test for range default
partitions, one test for list partitioning, one for multi-level
partitioning and one negative testcase with default partition missing
from one side of join.

May be we could reduce the number of SQL commands and queries in the
patch by adding default partition to every table that participates in
partition-wise join (leave the tables participating in negative tests
aside.). But that's going to increase the size of EXPLAIN outputs and
query results. The negative test may simply drop the default partition
from one of the tables.

For every table being tested, the patch adds two ALTER TABLE commands,
one for detaching an existing partition and then attach the same as
default partition. Alternative to that is just add a new default
partition without detaching and existing partition. But then the
default partition needs to populated with some data, which requires 1
INSERT statement at least. That doesn't reduce the size of patch, but
increases the output of query and EXPLAIN plan.

May be in case of multi-level partitioning test, we don't need to add
DEFAULT in every partitioned relation; adding to one of them would be
enough. May be add it to the parent, but that too can be avoided. That
would reduce the size of patch a bit.

Thanks Ashutosh for suggestions.

I have reduced test cases as suggested. Attaching updated patch.

Sorry Attached wrong patch.

attaching correct patch now.

Attachments:

partition_wise_join_with_default_partitions_v2.patchtext/x-patch; charset=US-ASCII; name=partition_wise_join_with_default_partitions_v2.patch
#302Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#301)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Tue, Dec 5, 2017 at 1:24 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I agree, the patch looks longer than expected. I think, it's important
to have some testcases to test partition-wise join with default
partitions. I think we need at least one test for range default
partitions, one test for list partitioning, one for multi-level
partitioning and one negative testcase with default partition missing
from one side of join.

May be we could reduce the number of SQL commands and queries in the
patch by adding default partition to every table that participates in
partition-wise join (leave the tables participating in negative tests
aside.). But that's going to increase the size of EXPLAIN outputs and
query results. The negative test may simply drop the default partition
from one of the tables.

For every table being tested, the patch adds two ALTER TABLE commands,
one for detaching an existing partition and then attach the same as
default partition. Alternative to that is just add a new default
partition without detaching and existing partition. But then the
default partition needs to populated with some data, which requires 1
INSERT statement at least. That doesn't reduce the size of patch, but
increases the output of query and EXPLAIN plan.

May be in case of multi-level partitioning test, we don't need to add
DEFAULT in every partitioned relation; adding to one of them would be
enough. May be add it to the parent, but that too can be avoided. That
would reduce the size of patch a bit.

Thanks Ashutosh for suggestions.

I have reduced test cases as suggested. Attaching updated patch.

Sorry Attached wrong patch.

attaching correct patch now.

Thanks. Here are some comments

+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;

I think we need an ANALYZE here in case the statistics gets updated while
DETACH and ATTACH is going on. Other testcases also need to be updated with
ANALYZE, including the negative one.

+-- partition-wise join can not be applied if the only one of joining table have

Correction: ... if only one of the joining tables has ...

Please add the patch to the next commitfest so that it's not
forgotten. I think we can get rid of the multi-level partition-wise
testcase as well. Also, since we are re-attaching existing partition
tables as default partitions, we don't need to check the output as
well; just plan should be enough.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#303Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#302)
1 attachment(s)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

Thanks. Here are some comments

Thanks Ashutosh for review and suggestions.

+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;

I think we need an ANALYZE here in case the statistics gets updated while
DETACH and ATTACH is going on. Other testcases also need to be updated with
ANALYZE, including the negative one.

Done.

+-- partition-wise join can not be applied if the only one of joining
table have

Correction: ... if only one of the joining tables has ...

Done.

Please add the patch to the next commitfest so that it's not
forgotten.

Done.
Added to CF: https://commitfest.postgresql.org/16/1426/

I think we can get rid of the multi-level partition-wise
testcase as well. Also, since we are re-attaching existing partition
tables as default partitions, we don't need to check the output as
well; just plan should be enough.

Ok. Done.

updated test patch attached.

Attachments:

partition_wise_join_with_default_partitions_v3.patchtext/x-patch; charset=US-ASCII; name=partition_wise_join_with_default_partitions_v3.patch
#304Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#303)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Thanks. Here are some comments

Thanks Ashutosh for review and suggestions.

+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;

I think we need an ANALYZE here in case the statistics gets updated while
DETACH and ATTACH is going on. Other testcases also need to be updated
with
ANALYZE, including the negative one.

Done.

+-- partition-wise join can not be applied if the only one of joining
table have

Correction: ... if only one of the joining tables has ...

Done.

Please add the patch to the next commitfest so that it's not
forgotten.

Done.
Added to CF: https://commitfest.postgresql.org/16/1426/

I think we can get rid of the multi-level partition-wise
testcase as well. Also, since we are re-attaching existing partition
tables as default partitions, we don't need to check the output as
well; just plan should be enough.

Ok. Done.

updated test patch attached.

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#305Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#304)
1 attachment(s)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

updated test patch attached.

Changed partition-wise statement to partitionwise.
Attached re-based patch.

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

Thanks Ashutosh.

Attachments:

partitionwise_join_with_default_partitions_v4.patchtext/x-patch; charset=US-ASCII; name=partitionwise_join_with_default_partitions_v4.patch
#306Thomas Munro
Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Rajkumar Raghuwanshi (#305)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
Changed partition-wise statement to partitionwise.
Attached re-based patch.

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

Hi Rajkumar,

partition_join ... FAILED

The regression test currently fails with your v4 patch because a
redundant Result node has been removed from a query plan. That may be
due to commit 11cf92f6 or nearby commits.

--
Thomas Munro
http://www.enterprisedb.com

#307Ashutosh Bapat
Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Thomas Munro (#306)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jun 6, 2018 at 8:11 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
Changed partition-wise statement to partitionwise.
Attached re-based patch.

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

Hi Rajkumar,

partition_join ... FAILED

That made my heart stop for fraction of a second. I thought, something
happened which caused partition_join test fail in master. But then I
realised you are talking about Rajkumar's patch and test in that
patch. I think it's better to start a separate thread discussing his
patch, before I loose my heart ;)

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#308Rajkumar Raghuwanshi
Rajkumar Raghuwanshi
rajkumar.raghuwanshi@enterprisedb.com
In reply to: Ashutosh Bapat (#307)
Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

On Wed, Jun 6, 2018 at 9:21 AM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jun 6, 2018 at 8:11 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Mon, Mar 5, 2018 at 8:13 PM, Rajkumar Raghuwanshi
<rajkumar.raghuwanshi@enterprisedb.com> wrote:

On Wed, Feb 7, 2018 at 2:00 PM, Ashutosh Bapat
Changed partition-wise statement to partitionwise.
Attached re-based patch.

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

Hi Rajkumar,

partition_join ... FAILED

Thanks Thomas for patch review.

That made my heart stop for fraction of a second. I thought, something

happened which caused partition_join test fail in master. But then I
realised you are talking about Rajkumar's patch and test in that
patch. I think it's better to start a separate thread discussing his
patch, before I loose my heart ;)

Yeah, that would be better.

here is the new thread with updated patch.
/messages/by-id/CAKcux6ky5YeZAY74qSh-ayPZZEQchz092g71iXXbC0+E3xoscA@mail.gmail.com

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation