Improve OR conditions on joined columns.

Started by Jim Nasbyabout 9 years ago40 messageshackers
Jump to latest
#1Jim Nasby
Jim.Nasby@BlueTreble.com

I've a client interested enough in $SUBJECT that they're willing to
offer a bounty on it. An example of the pain is (working example attached):

create temp view denorm as
select f.*, d1.t t1, d2.t t2
from fact f
left join dim d1 on f1=d1.id
left join dim d2 on f2=d2.id
;

-- Fast
explain analyze select count(*) from denorm where 1 in (f1,f2);
explain analyze select count(*) from denorm where '1' in (t1);

-- Slow
explain analyze select count(*) from denorm where '1' in (t1,t2);

They currently work around this by doing a union:

select ... from denorm where t1 = '1'
union
select ... from denorm where t2 = '1'

... or depending on needs using IN instead of =.

AFAICT this can be transformed into a UNION (not all) if dim.id is
unique. Does the upper planner pathification make this any easier?
There's another transform using arrays that's possible as well (see
attached example); I believe that would work regardless of uniqueness.

Just to be clear; the OR by itself is not a problem (as shown by the
first fast query); it's the OR with the JOIN that's a problem.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

Attachments:

test.sqltext/plain; charset=UTF-8; name=test.sql; x-mac-creator=0; x-mac-type=0Download
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#1)
Re: Improve OR conditions on joined columns.

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

AFAICT this can be transformed into a UNION (not all) if dim.id is
unique. Does the upper planner pathification make this any easier?

What I did in 9.6 is a first step. The next step, I think, is to
replace prepunion.c with something that can consider more than one
implementation path for a union.

Although ... actually, that may not be the bottleneck for what you're
after. The issue here is not so much discovering a clever plan for
a union as realizing that the query could be cast as a union in the
first place.

Maybe it'd be better to imagine this as something closer to planagg.c,
that is it knows how to apply a specific high-level optimization that
might or might not be a win, so it builds a path describing that and sees
if it looks cheaper than a path done the normal way. The fact that
we could even build a path describing a union is something that wasn't
there before 9.6, but maybe there's enough infrastructure for that now.

There's another transform using arrays that's possible as well (see
attached example); I believe that would work regardless of uniqueness.

That doesn't look terribly promising for automated application.
And I think it's really dependent on the exact shape of the OR
clause, which is an unpleasant limitation. Considering the amount
of work this will take to do at all, you'd want it to be pretty
general I think. I'm imagining something that would look for an
OR in which each clause referenced only one rel, then if it can
identify a way to re-unique-ify the result, split into a UNION
with an arm for each rel used in the OR. The nature of the
conditions in each OR arm don't seem to matter ... though probably
you'd have to insist on there not being volatile conditions anywhere
in the query, because they'd get evaluated too many times.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#2)
Re: Improve OR conditions on joined columns (common star schema problem)

On 2/8/17 5:54 PM, Tom Lane wrote:

Although ... actually, that may not be the bottleneck for what you're
after. The issue here is not so much discovering a clever plan for
a union as realizing that the query could be cast as a union in the
first place.

Right; their current workaround is to manually write a UNION. That's
*significantly* faster than the OR.

Maybe it'd be better to imagine this as something closer to planagg.c,
that is it knows how to apply a specific high-level optimization that
might or might not be a win, so it builds a path describing that and sees
if it looks cheaper than a path done the normal way. The fact that
we could even build a path describing a union is something that wasn't
there before 9.6, but maybe there's enough infrastructure for that now.

It's encouraging that there's at least a template to follow there... If
there is missing infrastructure, is it likely to be major? My uninformed
guess would be that the pathification was the major hurdle.

There's another transform using arrays that's possible as well (see
attached example); I believe that would work regardless of uniqueness.

That doesn't look terribly promising for automated application.
And I think it's really dependent on the exact shape of the OR
clause, which is an unpleasant limitation. Considering the amount

Not sure what you mean by shape; do you mean whether the OR conditions
are rels vs Consts or Vars?

of work this will take to do at all, you'd want it to be pretty
general I think. I'm imagining something that would look for an
OR in which each clause referenced only one rel, then if it can
identify a way to re-unique-ify the result, split into a UNION
with an arm for each rel used in the OR. The nature of the
conditions in each OR arm don't seem to matter ... though probably
you'd have to insist on there not being volatile conditions anywhere
in the query, because they'd get evaluated too many times.

In my experience, the common use case here is a large fact table that
joins to a number of dimension tables. If you can filter the large fact
table down to a fairly small size, those joins don't matter much. But if
a big chunk of your filter comes from the joins, you're in trouble.

UNION isn't really a great way to solve this problem because you still
end up doing a full join just to run the filter, but really all that you
need are the values from the dimension table(s) that you need for the
filter. IOW, change

WHERE t1 IN ('a','b') OR t2 IN ('c','d')

into

WHERE f1 IN (1,2) OR f2 IN (3,4)

(assuming a,b,c,d maps to 1,2,3,4)

BTW, there's an important caveat here: users generally do NOT want
duplicate rows from the fact table if the dimension table results aren't
unique. I thought my array solution was equivalent to what the JOINs
would do in that case but that's actually wrong. The array solution does
provide the behavior users generally want here though. JOIN is the
easiest tool to pick up for this, so it's what people gravitate to, but
I suspect most users would be happier with a construct that worked like
the array trick does, but was easier to accomplish.

I wonder if any other databases have come up with non-standard syntax to
do this.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Claudio Freire
klaussfreire@gmail.com
In reply to: Jim Nasby (#3)
Re: Improve OR conditions on joined columns (common star schema problem)

On Thu, Feb 9, 2017 at 9:50 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

WHERE t1 IN ('a','b') OR t2 IN ('c','d')

into

WHERE f1 IN (1,2) OR f2 IN (3,4)

(assuming a,b,c,d maps to 1,2,3,4)

BTW, there's an important caveat here: users generally do NOT want duplicate
rows from the fact table if the dimension table results aren't unique. I
thought my array solution was equivalent to what the JOINs would do in that
case but that's actually wrong. The array solution does provide the behavior
users generally want here though. JOIN is the easiest tool to pick up for
this, so it's what people gravitate to, but I suspect most users would be
happier with a construct that worked like the array trick does, but was
easier to accomplish.

I wonder if any other databases have come up with non-standard syntax to do
this.

What I've been doing is do those transforms (tn -> fn) in application
code. While it's a chore, the improvement in plans is usually well
worth the trouble.

IF there's a FK between fact and dimension tables, you can be certain
the transform will yield equivalent results, becuase you'll be certain
the joins don't duplicate rows.

So the transform should be rather general and useful.

If you have a join of the form:

a join b on a.f1 = b.id

Where a.f1 has an FK referencing b.id, and a filter on b X of any
form, you can turn the plan into:

with b_ids as (select id from b where X)
...
a join b on a.f1 = b.id and a.f1 in (select id from b_ids)

In order to be useful, the expected row count from b_ids should be rather small.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#3)
Re: Improve OR conditions on joined columns (common star schema problem)

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 2/8/17 5:54 PM, Tom Lane wrote:

Maybe it'd be better to imagine this as something closer to planagg.c,
that is it knows how to apply a specific high-level optimization that
might or might not be a win, so it builds a path describing that and sees
if it looks cheaper than a path done the normal way. The fact that
we could even build a path describing a union is something that wasn't
there before 9.6, but maybe there's enough infrastructure for that now.

It's encouraging that there's at least a template to follow there... If
there is missing infrastructure, is it likely to be major? My uninformed
guess would be that the pathification was the major hurdle.

I wrote a POC patch for this on a long airplane ride. It's not complete,
and I'm sure there are bugs as well, but it makes your example case
better. What I did about the de-duplication issue is to de-dup using
the CTIDs of all baserels in the query. This means the optimization
is only applicable to cases where all the rels have CTIDs ... but other
methods such as inspecting unique keys ain't gonna work for non-table
rels either, so I think this is about the best we can hope for.
However, I did not understand your point about:

BTW, there's an important caveat here: users generally do NOT want
duplicate rows from the fact table if the dimension table results aren't
unique.

so maybe we should be thinking about some other way entirely?

Anyway, what's attached below produces this on your example query:

Aggregate (cost=38.12..38.13 rows=1 width=8) (actual time=0.158..0.158 rows=1 loops=1)
-> Result (cost=38.03..38.11 rows=4 width=0) (actual time=0.151..0.155 rows=3 loops=1)
-> Unique (cost=38.03..38.07 rows=4 width=18) (actual time=0.150..0.154 rows=3 loops=1)
-> Sort (cost=38.03..38.04 rows=4 width=18) (actual time=0.150..0.151 rows=4 loops=1)
Sort Key: f.ctid, d1.ctid, d2.ctid
Sort Method: quicksort Memory: 25kB
-> Append (cost=4.85..37.99 rows=4 width=18) (actual time=0.070..0.106 rows=4 loops=1)
-> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.069..0.075 rows=2 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.035..0.038 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.023..0.025 rows=2 loops=1)
Recheck Cond: (f1 = d1.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.016..0.016 rows=2 loops=1)
Index Cond: (f1 = d1.s)
-> Index Scan using dim_pkey on dim d2 (cost=0.28..0.31 rows=1 width=10) (actual time=0.016..0.017 rows=0 loops=2)
Index Cond: (f.f2 = s)
-> Nested Loop Left Join (cost=4.85..19.00 rows=2 width=18) (actual time=0.025..0.029 rows=2 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=16) (actual time=0.022..0.025 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.017..0.019 rows=2 loops=1)
Recheck Cond: (f2 = d2.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.014..0.014 rows=2 loops=1)
Index Cond: (f2 = d2.s)
-> Index Scan using dim_pkey on dim d1 (cost=0.28..0.31 rows=1 width=10) (actual time=0.001..0.001 rows=0 loops=2)
Index Cond: (f.f1 = s)

The main remaining piece of work here is that, as you can see from the
above, it fails to eliminate joins to tables that we don't actually need
in a particular UNION arm. This is because the references to those
tables' ctid columns prevent analyzejoins.c from removing the joins.
I've thought about ways to deal with that but haven't come up with
anything that wasn't pretty ugly and/or invasive.

regards, tom lane

Attachments:

join-or-to-union-1.patchtext/x-diff; name=join-or-to-union-1.patchDownload+622-2
#6David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#5)
Re: Improve OR conditions on joined columns (common star schema problem)

On 12 February 2017 at 13:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote a POC patch for this on a long airplane ride. It's not complete,
and I'm sure there are bugs as well, but it makes your example case
better. What I did about the de-duplication issue is to de-dup using
the CTIDs of all baserels in the query. This means the optimization
is only applicable to cases where all the rels have CTIDs ... but other
methods such as inspecting unique keys ain't gonna work for non-table
rels either, so I think this is about the best we can hope for.
However, I did not understand your point about:

This is very interesting. Couldn't this be even more generic and
instead of looking at just the jointree quals, also look at the join
conditions too, as I think you can use this to also transforms queries
such as:

select * from t1 inner join t2 on t1.a = t2.a or t1.b = t2.b;

I imagine you'd also want an enable_unionor GUC which can be used to
disable this for when the planner makes a poor choice.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#6)
Re: Improve OR conditions on joined columns (common star schema problem)

David Rowley <david.rowley@2ndquadrant.com> writes:

This is very interesting. Couldn't this be even more generic and
instead of looking at just the jointree quals, also look at the join
conditions too, as I think you can use this to also transforms queries
such as:

The hard part of that is to remember exactly where you need to do surgery
on the querytree to replace the OR clause, keeping in mind that a tree
copy step is going to happen in between first searching for the OR and
doing the replacement. I'm sure it's soluble but it would've involved
more complexity than I felt justified for a POC.

I imagine you'd also want an enable_unionor GUC which can be used to
disable this for when the planner makes a poor choice.

It's not so much poor choices as the cost of the optimization attempt ---
if there's a K-relation OR clause, this will increase the cost of planning
by a factor approaching K+1, whether or not you get a better plan out of
it. I ran the regression tests with some instrumentation and determined
that this logic fires a dozen or two times, and fails to produce a plan
that looks cheaper than the standard plan in any of those cases. So if we
go down this road, not only do we need a GUC but I suspect it had better
default to off; only people using star schemas are really likely to get a
win out of it.

Maybe we could improve that situation by applying some heuristics that
prevent the optimization from being attempted unless it's more likely to
produce a win. But I'm not sure what that would look like. We have to
make the decision pretty early and without a lot of information. As an
example, Jim's case fails completely if we don't do the transformation
before reduce_outer_joins, because one of the things that *has* to happen
to get a desirable plan is to recognize that the extracted restriction
clause allows the left join to the dimension table to be simplified to an
inner join.

Having written the POC, I find myself not terribly satisfied with it.
It seems brute-force and not at all integrated with the rest of the
planner. Still, asking for integration with the existing UNION planning
is probably silly, since that really needs to be thrown away and rewritten.
Maybe this is a reasonable way forward until we get a clearer view of
what that area needs to look like.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#7)
Re: Improve OR conditions on joined columns (common star schema problem)

On 13 February 2017 at 06:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It's not so much poor choices as the cost of the optimization attempt ---
if there's a K-relation OR clause, this will increase the cost of planning
by a factor approaching K+1, whether or not you get a better plan out of
it. I ran the regression tests with some instrumentation and determined
that this logic fires a dozen or two times, and fails to produce a plan
that looks cheaper than the standard plan in any of those cases. So if we
go down this road, not only do we need a GUC but I suspect it had better
default to off; only people using star schemas are really likely to get a
win out of it.

I always try to shy away from assuming that the regression test suite
is a good reflection of a real world set of queries. It's full of
tests for edge cases that are rarely seen in reality. FWIW I did a
similar experiment with unique joins and was disappointed to see that
it didn't apply in more cases. Yet I've worked with OLTP applications
since 2005, and I struggle to recall any many:many joins at all.

Perhaps this optimisation is a candidate for only being applied when
some sort of planner_strength GUC (as mentioned in FOSDEM developer
meeting in 2016) reaches some threshold. There's certainly already
some planner smarts that can be skipped when such a GUC is set to a
lower level (e.g join removal). We could likely save many cycles if we
had the ability to re-plan queries where total_cost > X with more
smarts enabled.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: David Rowley (#8)
Re: Improve OR conditions on joined columns (common star schema problem)

On 2/12/17 5:06 PM, David Rowley wrote:

Yet I've worked with OLTP applications
since 2005, and I struggle to recall any many:many joins at all.

Interesting... I've run across it numerous times. In any case, for OLTP
there's other things you can do fairly easily.

Perhaps this optimisation is a candidate for only being applied when
some sort of planner_strength GUC (as mentioned in FOSDEM developer
meeting in 2016) reaches some threshold. There's certainly already
some planner smarts that can be skipped when such a GUC is set to a
lower level (e.g join removal). We could likely save many cycles if we
had the ability to re-plan queries where total_cost > X with more
smarts enabled.

Yeah, I strongly suspect some kind of "multi-stage" planner would be a
big win.

As for the POC, that's the same kind of plan I'm seeing IRL: a nested
loop gets used essentially to do the lookup of dimension text to
dimension ID.

I'm wondering if there's any tricks that could be applied on the sort
since it's dealing with CTIDs.

I do think it'd be even better if we had the ability to do that lookup
as part of planning, so you could discover the exact stats for the
relevant ID values, but that's even more involved.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: Improve OR conditions on joined columns (common star schema problem)

I wrote:

The main remaining piece of work here is that, as you can see from the
above, it fails to eliminate joins to tables that we don't actually need
in a particular UNION arm. This is because the references to those
tables' ctid columns prevent analyzejoins.c from removing the joins.
I've thought about ways to deal with that but haven't come up with
anything that wasn't pretty ugly and/or invasive.

I thought of a way that wasn't too awful, which was to inject the requests
for CTID columns only after we've done join removal. The attached v2
patch produces this for your test case:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=36.84..36.85 rows=1 width=8) (actual time=0.075..0.075 rows=1 loops=1)
-> Result (cost=36.77..36.83 rows=4 width=0) (actual time=0.069..0.073 rows=3 loops=1)
-> Unique (cost=36.77..36.79 rows=4 width=6) (actual time=0.069..0.072 rows=3 loops=1)
-> Sort (cost=36.77..36.78 rows=4 width=6) (actual time=0.068..0.069 rows=4 loops=1)
Sort Key: f.ctid
Sort Method: quicksort Memory: 25kB
-> Append (cost=4.57..36.73 rows=4 width=6) (actual time=0.018..0.033 rows=4 loops=1)
-> Nested Loop (cost=4.57..18.37 rows=2 width=6) (actual time=0.018..0.020 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d1 (cost=0.28..8.29 rows=1 width=10) (actual time=0.005..0.005 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.010..0.012 rows=2 loops=1)
Recheck Cond: (f1 = d1.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f1 (cost=0.00..4.29 rows=2 width=0) (actual time=0.006..0.006 rows=2 loops=1)
Index Cond: (f1 = d1.s)
-> Nested Loop (cost=4.57..18.37 rows=2 width=6) (actual time=0.009..0.011 rows=2 loops=1)
-> Index Scan using dim_t_key on dim d2 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
Index Cond: ('1'::text = t)
-> Bitmap Heap Scan on fact f (cost=4.30..10.05 rows=2 width=14) (actual time=0.005..0.006 rows=2 loops=1)
Recheck Cond: (f2 = d2.s)
Heap Blocks: exact=2
-> Bitmap Index Scan on f_f2 (cost=0.00..4.29 rows=2 width=0) (actual time=0.003..0.003 rows=2 loops=1)
Index Cond: (f2 = d2.s)
Planning time: 0.732 ms
Execution time: 0.156 ms
(25 rows)

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it. I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Potential future work includes finding join ORs in upper-level INNER JOIN
ON clauses, and improving matters so we can do the unique-ification with
hashing as well as sorting. I don't think either of those things has to
be in the first commit, though.

regards, tom lane

Attachments:

join-or-to-union-2.patchtext/x-diff; charset=us-ascii; name=join-or-to-union-2.patchDownload+708-2
#11Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#10)
Re: Improve OR conditions on joined columns (common star schema problem)

On 2/14/17 1:18 PM, Tom Lane wrote:

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it. I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)

Don't we normally provide a GUC for this level of query manipulation? We
can always remove it later if evidence shows there's not sufficient
downside (perhaps after gaining the ability for the planner to do extra
work on queries that appear to be large enough to warrant it).

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Another reason for a GUC...

I'll put some thought into it and see if I can find any holes. Are you
only worried about the removal of "useless" rels or is there more?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#11)
Re: Improve OR conditions on joined columns (common star schema problem)

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 2/14/17 1:18 PM, Tom Lane wrote:

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

I'll put some thought into it and see if I can find any holes. Are you
only worried about the removal of "useless" rels or is there more?

Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm. I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.
Now I'm wondering if there are other holes; or maybe I'm wrong about
that one and it's not necessary to be afraid of full joins.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13David Steele
david@pgmasters.net
In reply to: Tom Lane (#12)
Re: Improve OR conditions on joined columns (common star schema problem)

On 2/14/17 4:03 PM, Tom Lane wrote:

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 2/14/17 1:18 PM, Tom Lane wrote:

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

I'll put some thought into it and see if I can find any holes. Are you
only worried about the removal of "useless" rels or is there more?

Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm. I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.
Now I'm wondering if there are other holes; or maybe I'm wrong about
that one and it's not necessary to be afraid of full joins.

This patch applies cleanly (with offsets) and compiles at cccbdde.

Jim, have you had time to think about this? Any insights?

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: David Steele (#13)
Re: Improve OR conditions on joined columns (common star schema problem)

On 3/16/17 11:54 AM, David Steele wrote:

On 2/14/17 4:03 PM, Tom Lane wrote:

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:

On 2/14/17 1:18 PM, Tom Lane wrote:

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

I'll put some thought into it and see if I can find any holes. Are you
only worried about the removal of "useless" rels or is there more?

Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm. I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.
Now I'm wondering if there are other holes; or maybe I'm wrong about
that one and it's not necessary to be afraid of full joins.

This patch applies cleanly (with offsets) and compiles at cccbdde.

Jim, have you had time to think about this? Any insights?

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?

I'm also finding it a bit hard to follow the comment Tom mentioned. I'm
pretty sure I understand what's going on until this part:

The identical proof can be expected to apply
+  * in other arms, except in an arm that references that rel in its version
+  * of the OR clause.  But in such an arm, we have effectively added a
+  * restriction clause to what is known in other arms, which means that the
+  * set of rows output by that rel can't increase compared to other arms.

AIUI, this is describing a case something like this:

SELECT child.blah FROM child LEFT JOIN parent USING(parent_id)
WHERE child.foo AND (child.baz=1 or child.baz=2)

given that parent.parent_id is unique. Except for these concerns, there
would need to be a complex OR somewhere in here that sometimes
referenced parent and sometimes didn't, such as

WHERE child.foo AND (child.baz=1 OR parent.foo=3)

But I'm not following the logic here (very possibly because I'm wrong
about what I said above):

+  * Therefore the situation in such an arm must be that including the rel
+  * could result in either zero or one output row, rather than exactly one
+  * output row as in other arms.  So we still don't need to consider it for
+  * de-duplication.

I'm definitely not certain about the rest of it.

Tom, could you expand the description some, especially with some examples?
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#14)
Re: Re: Improve OR conditions on joined columns (common star schema problem)

Jim Nasby <jim.nasby@openscg.com> writes:

On 3/16/17 11:54 AM, David Steele wrote:

On 2/14/17 4:03 PM, Tom Lane wrote:

Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm. I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.

Jim, have you had time to think about this? Any insights?

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?

As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.

Tom, could you expand the description some, especially with some examples?

Here's a counterexample --- admittedly rather artificial, but still a
counterexample --- to applying the optimization when there are FULL joins.
Consider

SELECT count(*)
FROM a FULL JOIN b ON (a.id = b.id)
WHERE (a.x = 42 OR b.y = 43);

and suppose that a and b have mutual FK constraints guaranteeing that
every non-null a.id value has exactly one match in b and vice versa. (For
the purposes of this example, I think it doesn't matter whether or not we
allow there to be null id values.) Also suppose that there are some join
rows satisfying both a.x = 42 and b.y = 43, while other join rows satisfy
only one of the OR arms. The patch would try to implement this query as,
approximately,

SELECT count(*)
FROM
( SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE a.x = 42
UNION-using-ctids
SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE b.y = 43 )

Now in the first UNION arm, we'd observe that the strict a.x = 42
condition allows reduction of the join strength to LEFT JOIN, and then
we'd deduce from the FK on b that the join to b can be dropped altogether.
In the second arm, we'd similarly conclude that a can be dropped
altogether, leaving

SELECT count(*)
FROM
( SELECT FROM a WHERE a.x = 42
UNION-using-ctids
SELECT FROM b WHERE b.y = 43 )

But at this point there are no rels that are not eliminated in any UNION
arm, so that no de-duping would occur in the UNION, meaning that we'd
double-count any join rows in which both a.x = 42 and b.y = 43.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely. But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication. To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

So full joins definitely break this whole optimization. I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step. If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm. The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion. In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row. That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

Any clearer yet?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#15)
Re: Re: Improve OR conditions on joined columns (common star schema problem)

I wrote:

Consider
SELECT count(*)
FROM a FULL JOIN b ON (a.id = b.id)
WHERE (a.x = 42 OR b.y = 43);
and suppose that a and b have mutual FK constraints guaranteeing that
every non-null a.id value has exactly one match in b and vice versa.

Oh, that was sloppy of me. Join removal depends on unique constraints
not FK constraints. So actually, this example just requires assuming
that both a.id and b.id are known unique, which is much less far-fetched
than assuming circular FK constraints.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#15)
Re: Re: Improve OR conditions on joined columns (common star schema problem)

On 3/19/17 2:32 PM, Tom Lane wrote:

Jim Nasby <jim.nasby@openscg.com> writes:

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?

As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.

Agreed.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely. But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication. To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

It might still be worth-while in some circumstances. In your example, if
there were these indexes:

a__id ON a(id), a__x ON a(x)
b__id ON b(id), b__y ON b(y)

then it might be faster to nested loop a__x=42 to b__id=a.id and union
that with b__y=43 nested to a__id=b.id.

That said, now isn't the time to be adding more complexity.

So full joins definitely break this whole optimization. I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step. If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm. The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion. In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row. That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

The only case I can think of is: would it be possible (perhaps not
today; maybe in the future) for other parts of the query to affect join
elimination? I can't conceive of how that might happen, but if it did
then it's possible that the elimination would work differently with the
UNION than it would with an OR.

The comment on join_is_removable() does mention that there's other
potentially interesting cases that we can't handle now; it's maybe worth
mentioning

Other than that, I can't see any issues with your logic.

Any clearer yet?

Absolutely. I think it would be very valuable to include that with the
initial comment in planunionor.c. Join reduction and removal is already
tricky enough to wrap your head around.

Other comments:

+ * is retty mechanical, but we can't do it until we have a RelOptInfo for the

s/retty/pretty/

I suspect that in many systems single-table queries are far more common
than CTEs, so maybe it's worth reversing those two tests in
split_join_or_clauses().

For the Unique path, it would be nice if the code did what would be
necessary to consider a TID hash join, since that means a user could
create the appropriate operator and it would just be picked up.
Certainly not worth much effort at this point though.

+ 	/*
+ 	 * Must not have any volatile functions in FROM or WHERE (see notes at
+ 	 * head of file).
+ 	 */
+ 	if (contain_volatile_functions((Node *) parse->jointree))

Is there by chance anywhere else that needs to check that? Maybe worth
adding the info to the Query struct if so.

+ * We insist that all baserels used in the query be plain relations, so

Dumb question... views have already be decomposed at this point, right?

Perhaps less dumb question... what happens if the original query already
had setOps? AIUI setOps work has already been done by the time
split_join_or_clauses() happens; I just want to check that that's OK.

I'm not sure a GUC is worth it... I suspect that any query with multiple
rels and an OR condition is going to be expensive enough that whatever
additional plan time there is won't be noticeable.

I've verified that the patch still applies and make check-world is clean.
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#10)
Re: Improve OR conditions on joined columns (common star schema problem)

Hi,

On 2017-02-14 14:18:54 -0500, Tom Lane wrote:

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it. I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Potential future work includes finding join ORs in upper-level INNER JOIN
ON clauses, and improving matters so we can do the unique-ification with
hashing as well as sorting. I don't think either of those things has to
be in the first commit, though.

Are you planning to push this into v10? Given the looming freeze I'm
inclined to bump this to the next CF.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#18)
Re: Improve OR conditions on joined columns (common star schema problem)

Andres Freund <andres@anarazel.de> writes:

On 2017-02-14 14:18:54 -0500, Tom Lane wrote:

One point that could use further review is whether the de-duplication
algorithm is actually correct. I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Are you planning to push this into v10? Given the looming freeze I'm
inclined to bump this to the next CF.

I'm still not fully convinced the patch is correct, so I have no problem
with bumping it out.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#17)
Re: Re: Improve OR conditions on joined columns (common star schema problem)

Jim Nasby <jim.nasby@openscg.com> writes:

I've verified that the patch still applies and make check-world is clean.

Not any more :-(. Here's a v3 rebased over HEAD. No substantive
change from v2.

regards, tom lane

Attachments:

join-or-to-union-3.patchtext/x-diff; charset=us-ascii; name=join-or-to-union-3.patchDownload+709-2
#21Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#20)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#25)
In reply to: Tom Lane (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#29David Rowley
dgrowleyml@gmail.com
In reply to: Tom Lane (#24)
#30David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#22)
In reply to: Tom Lane (#31)
In reply to: Peter Geoghegan (#32)
#34Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#31)
#35Michael Paquier
michael@paquier.xyz
In reply to: Noah Misch (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#35)
#37Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#37)
#39Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#38)
#40Andy Fan
zhihui.fan1213@gmail.com
In reply to: Noah Misch (#39)