Re: [Proposal] Table partition + join pushdown

Started by Kouhei Kaigaialmost 10 years ago2 messages

kaigai@ak.jp.nec.com

almost 10 years ago

On Tue, Jan 19, 2016 at 7:59 AM, Greg Stark <stark@mit.edu> wrote:

On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

For
example, suppose that x and y are numeric columns and P(x) is
length(x::text) == 3. Then you could have 1 in one table and 1.0 in
the table; they join, but P(x) is true for one and false for the
other.

Fwiw, ages ago there was some talk about having a property on
functions "equality preserving" or something like that. If a function,
or more likely a <function,operator> tuple had this property set then
x op y => f(x) op f(y). This would be most useful for things like
substring or hash functions which would allow partial indexes or
partition exclusion to be more generally useful.

Of course then you really want <f,op1,op2> to indicate that "a op1 b
=> f(a) op2 f(b)" so you can handle things like <substring,lt,lte > so
that "a < b => substring(a,n) <= substring(b,n)" and you need some way
to represent the extra arguments to substring and the whole thing
became too complex and got dropped.

But perhaps even a simpler property that only worked for equality and
single-argument functions would be useful since it would let us mark
hash functions Or perhaps we only need to mark the few functions that
expose properties that don't affect equality since I think there are
actually very few of them.

We could certainly mark operators that amount to testing binary
equality as such, and this optimization could be used for join
operators so marked. But I worry that would become a crutch, with
people implementing optimizations that work for such operators and
leaving numeric (for example) out in the cold. Of course, we could
worry about such problems when and if they happen, and accept the idea
of markings for now. However, I'm inclined to think that there's a
better way to optimize the case Taiki Kondo and Kouhei Kagai are
targeting.

It seems to me Greg's idea intends to reduce CPU cycles by replacement
of the operator in use. I never deny we can have valuable scenarios,
however, we intend this feature to reduce amount of inner hash-table
size, to fit GPU RAM for example.

If we get declarative partitioning, an oft-requested feature that has
been worked on by various people over the years and currently by Amit
Langote, and specifically if we get hash partitioning, then we'll
presumably use the hash function for the default operator class of the
partitioning column's datatype to partition the table. Then, if we do
a join against some other table and consider a hash join, we'll be
using the same hash function on our side, and either the same operator
or a compatible operator for some other datatype in the same opfamily
on the other side. At that point, if we push down the join, we can
add a filter on the inner side of the join that the hash value of the
matching column has to map to the partition it's being joined against.
And we don't get a recurrence of this problem in that case, because
we're not dealing with an arbitrary predicate - we're dealing with a
hash function whose equality semantics are defined to be compatible
with the join operator.

That approach works with any data type that has a default hash
operator class, which covers pretty much everything anybody is likely
to care about, including numeric.

Except for usage of CHECK constraint, the above description is almost
same as we are intending. Hash joinable operators are expected to check
equality of both side at least, thus, we can predicate which inner
columns shall have identical value once both tuples are joined.
Then, we can filter out rows obviously obviously unmatched rows.

At that point, if we push down the join, we can
add a filter on the inner side of the join that the hash value of the
matching column has to map to the partition it's being joined against.

Of course, its implementation is not graceful enough, especially, above
point because this extra filter will change expected number of rows to
be produced by inner relation, and relevant cost.
Right now, his patch calls cost_seqscan() and others according to the
type of inner relation by itself. Of course, it is not a portable way,
if inner relation would not be a simple relations scan.

Due to path construction staging, AppendPath with underlying join paths
has to be constructed on join path investigation steps. So, what is the
reasonable way to make inner relation's path node with filters pushed-
down?
It is the most ugly part of the current patch.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Kouhei Kaigai (#1)

On Mon, Jan 25, 2016 at 9:09 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Of course, its implementation is not graceful enough, especially, above
point because this extra filter will change expected number of rows to
be produced by inner relation, and relevant cost.
Right now, his patch calls cost_seqscan() and others according to the
type of inner relation by itself. Of course, it is not a portable way,
if inner relation would not be a simple relations scan.

Due to path construction staging, AppendPath with underlying join paths
has to be constructed on join path investigation steps. So, what is the
reasonable way to make inner relation's path node with filters pushed-
down?
It is the most ugly part of the current patch.

I think that it needs to be done only in contexts where we can
guarantee that the optimization is correct, like declarative hash
partitioning:

/messages/by-id/CA+Tgmob2wfJivFoCDLuUnovJsFTp6QsqxiPpxvNNoGLY+3rjbg@mail.gmail.com

As I said upthread, in general your proposal will not work and will
lead to wrong answers to queries.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers