Plan to support predicate push-down into subqueries with aggregates?
Hi all.
I was wondering if there were any plans to support predicate push-down
optimization for subqueries (and views) with aggregates?
I was recently bit by this, as I had incorrectly assumed that this was an
optimization that was in place, and designed quite a bit around that
assumption, only to get hit with terrible performance when more data got
loaded into the system.
Currently I had to solve the issue by having aggregate tables which store
the data, which is maintained by triggers on IUD. This gets messy quick,
as I have some aggregates which are dependent on 5-6 other tables. I'd
love to be able to just store the logic for calculating aggregates in a
view, and use that at query time instead of having to deal with it like
this.
I have written a test case script that explains the type of queries I am
talking about, and the issues. It's heavily simplified compared to a real
system, but it'll do. Please see attached.
Thanks,
-Adam
Attachments:
subquery_pushdown.txttext/plain; charset=US-ASCII; name=subquery_pushdown.txtDownload
Adam Brusselback <adambrusselback@gmail.com> writes:
I was wondering if there were any plans to support predicate push-down
optimization for subqueries (and views) with aggregates?
Right offhand I would say that that's a complete mischaracterization
of the problem. I've not tried to run your test case, but you say
--Quick, is able to push down because it's a simple equality check
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.header_id = 26;--Slow, no pushdown
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.header_id < 200;--Slow, no pushdown
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.description like '%5%';
There's no preference for equalities over other kinds of predicates
as far as subquery pushdown is concerned. I think what your real
problem is is that in the first case, the system will derive the
additional condition "header_total.header_id = 26", while in the
second case it will not deduce "header_total.header_id < 200".
That's because the machinery for deducing such implied constraints
works only with equalities. That's not very likely to change anytime
soon, and even if it did, the inference would only extend to operators
that are members of the same btree family as the join equality operator.
Your example with a LIKE clause is always going to be out in the cold,
because there is no principled basis for the planner to decide that
"a = b" means that "a LIKE x" and "b LIKE x" will give the same result.
It hasn't got enough information about the behavior of LIKE to know
if that's safe or not. (It does, on the other hand, know very well that
SQL equality operators don't necessarily guarantee bitwise identity.)
So I'd suggest just modifying your queries to write out both constraints
explicitly.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thanks Tom, appreciate the reply.
Sorry if I didn't call it the correct thing. I just know that with trying
to encapsulate this aggregate logic in a view, I am unable to use that view
in a query that I know is only going to touch a subset of the data without
incurring a performance hit from the view doing seq scans on all of the
rows in the detail_1 and detail_2 tables, and then throwing out 99% of the
results when the filter is applied.
I had initially started creating functions that would take an array of ids
as a parameter, and manually push them down in the subqueries. That got
really really messy though, and we moved away from doing that to having the
aggregates eagerly materialized to a table with triggers.
Are there any other options for making this type of query faster? It could
be that I just am totally missing a better way to do this. I do really
want to be able to contain that logic within a view of some sort though, as
a bunch of other stuff is built on top of that. Having to push that
aggregate query into all of those other queries would be hell.
Thanks,
-Adam
On Tue, Mar 8, 2016 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Adam Brusselback <adambrusselback@gmail.com> writes:
I was wondering if there were any plans to support predicate push-down
optimization for subqueries (and views) with aggregates?Right offhand I would say that that's a complete mischaracterization
of the problem. I've not tried to run your test case, but you say--Quick, is able to push down because it's a simple equality check
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.header_id = 26;--Slow, no pushdown
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.header_id < 200;--Slow, no pushdown
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.description like '%5%';There's no preference for equalities over other kinds of predicates
as far as subquery pushdown is concerned. I think what your real
problem is is that in the first case, the system will derive the
additional condition "header_total.header_id = 26", while in the
second case it will not deduce "header_total.header_id < 200".
That's because the machinery for deducing such implied constraints
works only with equalities. That's not very likely to change anytime
soon, and even if it did, the inference would only extend to operators
that are members of the same btree family as the join equality operator.
Your example with a LIKE clause is always going to be out in the cold,
because there is no principled basis for the planner to decide that
"a = b" means that "a LIKE x" and "b LIKE x" will give the same result.
It hasn't got enough information about the behavior of LIKE to know
if that's safe or not. (It does, on the other hand, know very well that
SQL equality operators don't necessarily guarantee bitwise identity.)So I'd suggest just modifying your queries to write out both constraints
explicitly.regards, tom lane
We bottom (or inline) post on these lists.
On Tue, Mar 8, 2016 at 3:37 PM, Adam Brusselback <adambrusselback@gmail.com>
wrote:
On Tue, Mar 8, 2016 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Adam Brusselback <adambrusselback@gmail.com> writes:
I was wondering if there were any plans to support predicate push-down
optimization for subqueries (and views) with aggregates?Right offhand I would say that that's a complete mischaracterization
of the problem. I've not tried to run your test case, but you say--Quick, is able to push down because it's a simple equality check
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.header_id = 26;--Slow, no pushdown
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.header_id < 200;--Slow, no pushdown
SELECT *
FROM header
INNER JOIN header_total
USING (header_id)
WHERE header.description like '%5%';There's no preference for equalities over other kinds of predicates
as far as subquery pushdown is concerned. I think what your real
problem is is that in the first case, the system will derive the
additional condition "header_total.header_id = 26", while in the
second case it will not deduce "header_total.header_id < 200".
That's because the machinery for deducing such implied constraints
works only with equalities. That's not very likely to change anytime
soon, and even if it did, the inference would only extend to operators
that are members of the same btree family as the join equality operator.
Your example with a LIKE clause is always going to be out in the cold,
because there is no principled basis for the planner to decide that
"a = b" means that "a LIKE x" and "b LIKE x" will give the same result.
It hasn't got enough information about the behavior of LIKE to know
if that's safe or not. (It does, on the other hand, know very well that
SQL equality operators don't necessarily guarantee bitwise identity.)So I'd suggest just modifying your queries to write out both constraints
explicitly.
Thanks Tom, appreciate the reply.Sorry if I didn't call it the correct thing. I just know that with trying
to encapsulate this aggregate logic in a view, I am unable to use that view
in a query that I know is only going to touch a subset of the data without
incurring a performance hit from the view doing seq scans on all of the
rows in the detail_1 and detail_2 tables, and then throwing out 99% of the
results when the filter is applied.I had initially started creating functions that would take an array of ids
as a parameter, and manually push them down in the subqueries. That got
really really messy though, and we moved away from doing that to having the
aggregates eagerly materialized to a table with triggers.Are there any other options for making this type of query faster? It
could be that I just am totally missing a better way to do this. I do
really want to be able to contain that logic within a view of some sort
though, as a bunch of other stuff is built on top of that. Having to push
that aggregate query into all of those other queries would be hell.
Maybe something like:
CREATE TYPE header_total_item (
amount numeric,
count integer
);
CREATE VIEW AS
SELECT header_id, header_amount
COALESCE(
(SELECT ROW(sum(rate * quantity), count(detail_1_id))::header_total_item
FROM detail_1 WHERE detail_1.header_id = header.header_id)
ROW(0.000,0)::header_total_item
) AS detail_1_total_item
FROM header;
Doesn't solve the "balance" column without going into a sub-query...which
I suspect is possible but don't have time to test.
I do understand the question as to why the view cannot be re-written as:
I don't have time to get into this deeper right now. The main question is
basically how can you force a parameterized nested join on the LEFT JOIN
sub-queries given that the system has equality joins between them and
header and you've supposedly provided a very selective predicate for the
rows being returned by header.
David J.
I responded yesterday, but it seems to have gotten caught up because it was
too big with the attachments... Here it is again.
Sorry about not posting correctly, hopefully I did it right this time.
So I wanted to see if Sql Server (2014) could handle this type of query
differently than Postgres (9.5.1), so I got an instance of express
installed and ported the test script to it.
I updated my Postgres script so the data is the same in each server. The
end result is Sql Server seems to be able to optimize all of these queries
MUCH better than Postgres.
I disabled parallelism in Sql Server to make the comparison fair.
I've attached the explain analyze results for Postgres, and the execution
plan for Sql Server (in picture form... don't know a better way)
Results are:
--Sql Server:15ms average
--Postgres: 6ms average
SELECT *
FROM header
INNER JOIN header_total
ON header.header_id = header_total.header_id
WHERE header.header_id = 26;
--Sql Server: 15ms average
--Postgres: 1250ms average
SELECT *
FROM header
INNER JOIN header_total
ON header.header_id = header_total.header_id
WHERE header.header_id < 27
AND header.header_id > 24;
--Sql Server: 567ms average
--Postgres: 1265ms average
SELECT *
FROM header
INNER JOIN header_total
ON header.header_id = header_total.header_id
WHERE header.description like '%5%';
--Sql Server: 15ms average
--Postgres: 1252ms average
SELECT *
FROM header_total
WHERE header_total.header_id IN (
SELECT header_id
FROM header
WHERE header.header_id < 27
AND header.header_id > 24);
Here are the sql server execution plans as links rather than attachments:
https://drive.google.com/file/d/0BzWRjbj6CQLeb29JZ0lMMnp4QTA/view?usp=sharing
https://drive.google.com/file/d/0BzWRjbj6CQLeM2t0MmZDdE03OHc/view?usp=sharing
https://drive.google.com/file/d/0BzWRjbj6CQLeV0hjRmM5NE9CTWc/view?usp=sharing
https://drive.google.com/file/d/0BzWRjbj6CQLeNmdlQWpHYU1BVHM/view?usp=sharing
On Wed, 2016-03-09 at 12:25 -0500, Adam Brusselback wrote:
I responded yesterday, but it seems to have gotten caught up because
it was too big with the attachments... Here it is again.Sorry about not posting correctly, hopefully I did it right this
time.So I wanted to see if Sql Server (2014) could handle this type of
query differently than Postgres (9.5.1), so I got an instance of
express installed and ported the test script to it.I updated my Postgres script so the data is the same in each server.
The end result is Sql Server seems to be able to optimize all of
these queries MUCH better than Postgres.
I disabled parallelism in Sql Server to make the comparison fair.I've attached the explain analyze results for Postgres, and the
execution plan for Sql Server (in picture form... don't know a better
way)Results are:
--Sql Server:15ms average
--Postgres: 6ms average
SELECT *
FROM header
INNER JOIN header_total
ON header.header_id = header_total.header_id
WHERE header.header_id = 26;--Sql Server: 15ms average
--Postgres: 1250ms average
SELECT *
FROM header
INNER JOIN header_total
ON header.header_id = header_total.header_id
WHERE header.header_id < 27
AND header.header_id > 24;--Sql Server: 567ms average
--Postgres: 1265ms average
SELECT *
FROM header
INNER JOIN header_total
ON header.header_id = header_total.header_id
WHERE header.description like '%5%';--Sql Server: 15ms average
--Postgres: 1252ms average
SELECT *
FROM header_total
WHERE header_total.header_id IN (
SELECT header_id
FROM header
WHERE header.header_id < 27
AND header.header_id > 24);
Hello Adam,
I can't believe that something originating from the dark side can do
things better. So I went thru your test case. Running this on my trusty
Compaq Presario with 5Gb of RAM and using Postgres 9.5.1.
1) Created your tables but re-named the column "description" as
"header_description". header has 1,000 rows, detail_1 has 400,391 rows
and detail_2 has 3,451 rows.
2) altered your index definitions to make them unique by including
their primary keys. (I have a penchant for unique indexes).
3) Ran the ANALYZE.
4) Shut down the database so as to clear the cache.
5) A while later started the database and created this view:-
CREATE OR REPLACE VIEW header_total AS
SELECT header.header_id, header.header_description,
amount_1, detail_1_count,
amount_2, detail_2_count
FROM header
LEFT JOIN (
SELECT header_id, SUM(rate * quantity) AS amount_1, COUNT(detail_1_id)
AS detail_1_count
FROM detail_1
GROUP BY detail_1.header_id
) detail_1
ON header.header_id = detail_1.header_id
LEFT JOIN (
SELECT header_id, SUM(amount) AS amount_2, COUNT(detail_2_id) AS
detail_2_count
FROM detail_2
GROUP BY detail_2.header_id
) detail_2
ON header.header_id = detail_2.header_id
Note that:- (a) I included header_description in the definition.
(b) Removed some lines as if you want the total you may as well
include it in your select from the view, and for the life of me I
couldn't understand the purpose of:-
header.amount = coalesce(detail_1.amount, 0) +
coalesce(detail_2.amount, 0) as balanced
Is "balanced" supposed to be a boolean?
If you need header.amount include it in the view.
6) Ran your three queries and here are the timings from the log:-
2016-03-10 13:07:47 AEDTLOG: duration: 0.221 ms parse <unnamed>:
SELECT * FROM header_total WHERE header_id = 26
2016-03-10 13:07:47 AEDTLOG: duration: 0.551 ms bind <unnamed>:
SELECT * FROM header_total WHERE header_id = 26
2016-03-10 13:07:47 AEDTLOG: duration: 1.103 ms execute <unnamed>:
SELECT * FROM header_total WHERE header_id = 26
2016-03-10 13:07:54 AEDTLOG: duration: 0.180 ms parse <unnamed>:
SELECT * FROM header_total WHERE header_id < 200
2016-03-10 13:07:54 AEDTLOG: duration: 0.481 ms bind <unnamed>:
SELECT * FROM header_total WHERE header_id < 200
2016-03-10 13:07:55 AEDTLOG: duration: 458.418 ms execute <unnamed>:
SELECT * FROM header_total WHERE header_id < 200
2016-03-10 13:08:01 AEDTLOG: duration: 0.230 ms parse <unnamed>:
SELECT * FROM header_total WHERE header_description like '%5%'
2016-03-10 13:08:01 AEDTLOG: duration: 0.542 ms bind <unnamed>:
SELECT * FROM header_total WHERE header_description like '%5%'
2016-03-10 13:08:01 AEDTLOG: duration: 459.346 ms execute <unnamed>:
SELECT * FROM header_total WHERE header_description like '%5%'
I don't believe that (0.230 ms + 0.542 ms + 459.346 ms) could be
described as "slow" when it returns 271 rows.
Obviously it would help if there were more details about your
application. Also creating the test data via those bulk inserts doesn't
replicate any randomness that may occur via inserts made by an
application.
BTW, I'm a great fan of using views.
HTH,
Rob
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, March 9, 2016, rob stone <floriparob@gmail.com> wrote:
On Wed, 2016-03-09 at 12:25 -0500, Adam Brusselback wrote:
Note that:- (a) I included header_description in the definition.
(b) Removed some lines as if you want the total you may as well
include it in your select from the view, and for the life of me I
couldn't understand the purpose of:-header.amount = coalesce(detail_1.amount, 0) +
coalesce(detail_2.amount, 0) as balancedIs "balanced" supposed to be a boolean?
If you need header.amount include it in the view.
It's basically accounting. Does the total amount on the header
(header.amount) equal the sum of the two components (1.amount + 2.amount),
returns true or false based upon the equality. So not only do you need to
compute 1.amount but you also need to immediately use it as part of an
expression. And return both usages from the view.
David J.
Rob,
I understand that if I were to replicate the logic in that view for every
use case I had for those totals, this would not be an issue. But that would
very much complicate some of my queries to the point of absurdity if I
wanted to write them in a way which would push everything down properly.
The issue is, that I need that data to be able to join to that view from
other smaller sets of data and not have it calculate the totals for every
"header" I have in my system, just to throw away 99% of them.
My application is for contract and rebate management, so as David said,
basically accounting. We allow users to set up contracts to sell products
to their customers, and then track and verify the rebates they get back are
correct, and that they're not overpaying.
The equivalent of the header_total view is used in quite a few different
places in my application. In one case, the state of one object in my
system (called a deduction) is derived from 5 different data points, 2 of
which are from my equivalent of the "header total" view. Things like the
state for the deduction object are derived from a query that I encapsulated
inside a view. You can see how this proliferates.
In the end, after I switched to materialized tables with this data stored,
querying things that relied on this aggregated data got much much faster.
In my largest client's database, the query to get the deduction state for a
single deduction went from 5 seconds, down to 2ms or so. Unsurprisingly,
if I wanted the deduction state for every deduction the system, vs only
one, before it was 5 seconds either way, as it had to aggregate all of that
detail level data no matter what, and then throw most of it away if I
wanted just a single deduction.
I would very much rather not have to use the materialized tables to get
good performance, and just use views to get this data instead. I don't
like having to have so many triggers to control the materialized tables, it
leaves too much room for inconsistent data where as a view querying the
underlying data directly leaves no room for inconsistency.
I understand that not everyone has the same use case as I do, but I can see
from my tests that the "dark side" does seem to be able to optimize for
it. This doesn't seem like an uncommon use case to me (think of banking
software, with an account table, and transaction table, and having the
account balance derived in a view), and i'd love to see it supported.
I suppose there are two different ways this could go to improve my
situation: 1) better optimization for this type of query. 2) materialized
views getting more features like refreshing when the underlying data is
changed like detailed here
<https://wiki.postgresql.org/images/8/85/Materialised_Views_-_FOSDEM.pdf> (road
map part).
Either would be great in my book!
The feedback is very appreciated, I was just trying to see with this post
if there was any plan / ideas / anything at all in regards to this type of
use case (or better ways of doing it that I hadn't thought of).
On Thu, 2016-03-10 at 07:09 -0500, Adam Brusselback wrote:
Rob,
I understand that if I were to replicate the logic in that view for
every use case I had for those totals, this would not be an issue.
But that would very much complicate some of my queries to the point
of absurdity if I wanted to write them in a way which would push
everything down properly. The issue is, that I need that data to be
able to join to that view from other smaller sets of data and not
have it calculate the totals for every "header" I have in my system,
just to throw away 99% of them.My application is for contract and rebate management, so as David
said, basically accounting. We allow users to set up contracts to
sell products to their customers, and then track and verify the
rebates they get back are correct, and that they're not overpaying.
The equivalent of the header_total view is used in quite a few
different places in my application. In one case, the state of one
object in my system (called a deduction) is derived from 5 different
data points, 2 of which are from my equivalent of the "header total"
view. Things like the state for the deduction object are derived
from a query that I encapsulated inside a view. You can see how this
proliferates.
Hello Adam,
With regards to proliferation, I guess it depends on whether you have
multiple views or multiple tables and which is best suited to your
application. If you have a table of customers all you need is a single
table of transactions where one of the columns is a foreign key
pointing to the customer. Want to know a customer's outstanding balance
-- sum their transactions. You wouldn't have a table just for invoices,
one for credit notes, one for discounts given, one for payments, etc.,
etc. The same goes if you want to know the stock on-hand and stock
available quantities for a given product/warehouse. Do it all via a
view or views, as appropriate. Have a column in the table
transaction_type CHAR(1) NOT NULL so you can put a CASE statement
across it so as to know whether to add or subtract.
In your test case example, if you allow alterations to the values in
the columns rate, quantity or amount, you need triggers in order to
maintain database integrity. Then if somebody with the appropriate
privileges does "ALTER TABLE blah DISABLE ALL TRIGGERS;" your database
integrity goes out the window.
All of the above is completely off your point. I just like to keep it
as straight forward as possible.
My apologies for having a bit of a rant.
Regards,
Rob
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general