Use cases for lateral that do not involve a set returning function
I was reading this post the other day:
http://blog.heapanalytics.com/postgresqls-powerful-new-join-type-lateral/
I suspected some of the claims in the post may not have been accurate. This
one in particular:
"Without lateral joins, we would need to resort to PL/pgSQL to do this
analysis. Or, if our data set were small, we could get away with complex,
inefficient queries."
The sum(1) and order by time limit 1 approach seemed less than ideal to me
and I thought this analysis could be done with normal left joins instead of
lateral left joins. So I came up with a proof of concept:
https://github.com/ajw0100/snippets/tree/master/SQL/lateral
Is my conclusion in the README correct? Does anything beyond
select...from...where force a nested loop? In that case, is lateral really
only useful with set returning functions as the docs suggest? Does anyone
know of any use cases for lateral that do not involve a set returning
function?
Thanks,
AJ
AJ Welch wrote:
http://blog.heapanalytics.com/postgresqls-powerful-new-join-type-lateral/
I suspected some of the claims in the post may not have been accurate. This one in particular:
"Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set
were small, we could get away with complex, inefficient queries."The sum(1) and order by time limit 1 approach seemed less than ideal to me and I thought this analysis
could be done with normal left joins instead of lateral left joins. So I came up with a proof of
concept:https://github.com/ajw0100/snippets/tree/master/SQL/lateral
Is my conclusion in the README correct? Does anything beyond select...from...where force a nested
loop? In that case, is lateral really only useful with set returning functions as the docs suggest?
Does anyone know of any use cases for lateral that do not involve a set returning function?
Only recently I used lateral joins to optimize a query.
This is a sample of how the query looked bfore:
SELECT ...
FROM people p
LEFT JOIN names n
ON (n.people_id = p.people_id
AND current_timestamp > n.validfrom
AND NOT EXISTS (SELECT 1 FROM names n2
WHERE n2.people_id = p.people_id
AND current_timestamp > n2.validfrom
AND n2.validfrom > n.validfrom)
)
WHERE p.id = ...
So basically it is supposed to find the latest valid name for a person.
This required two scans of the "names" table per "person" record.
I rewrote it as
SELECT ...
FROM people p
LEFT JOIN LATERAL (SELECT * FROM names n
WHERE n.people_id = p.people_id
AND current_timestamp > n.validfrom
ORDER BY n.validfrom DESC LIMIT 1) n
ON TRUE
WHERE p.id = ...
With the correct index this touched fewer blocks and worked faster.
Also, though this is of course a matter of taste, it is more readable.
Of course this forces a nested loop, but that is not bad as such.
In my case it was not problem (I tried to hint at that with the WHERE clause).
So yes, I think that LATERAL is useful even without set returning functions.
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thanks for the response. Yea, using lateral there definitely reads better
to me than using a correlated subquery. And it makes sense that performance
is ok since you're filtering on a specific person's id (as you hinted at
with `WHERE p.id = ...`) and the nested loop forced by `order by...limit 1`
presumably only loops once.
However, I would consider that more of an OLTP style query where I'm kind
of more interested in OLAP style queries as the referenced post was
building a funnel analysis over an events table. I don't think their
approach will scale. I guess it's kind of a specific question but that post
got me wondering if there are any use cases for lateral outside of SRFs
that do not generate a nested loop AND can not be achieved without lateral?
Basically, something I could use in an OLAP query that I couldn't use prior
to 9.3.
Also, just for the heck of it, I took a look at the explain plans for both
of your queries without `WHERE p.id = ...` to see how they would scale.
Correlated subquery:
QUERY PLAN
Hash Right Join (cost=163943.00..5932603426739.67 rows=7499298 width=24)
Hash Cond: (n.people_id = p.people_id)
Join Filter: (NOT (SubPlan 1))
-> Seq Scan on names n (cost=0.00..320543.41 rows=14998595 width=18)
Filter: (now() > validfrom)
-> Hash (cost=77028.00..77028.00 rows=5000000 width=10)
-> Seq Scan on people p (cost=0.00..77028.00 rows=5000000
width=10)
SubPlan 1
-> Seq Scan on names n2 (cost=0.00..395543.88 rows=1 width=0)
Filter: ((validfrom > n.validfrom) AND (people_id = p.people_id)
AND (now() > validfrom))
Lateral:
Nested Loop Left Join (cost=358043.67..1790218514528.00 rows=5000000
width=24)
-> Seq Scan on people p (cost=0.00..77028.00 rows=5000000 width=10)
-> Limit (cost=358043.67..358043.67 rows=1 width=18)
-> Sort (cost=358043.67..358043.68 rows=4 width=18)
Sort Key: n.validfrom
-> Seq Scan on names n (cost=0.00..358043.65 rows=4
width=18)
Filter: ((people_id = p.people_id) AND (now() >
validfrom))
Granted I haven't set up any indexes, but it looks like the correlated
subquery, after an initial access of the names table before the join,
accesses the names table again for each (person, name) pair after the join
(in the join filter). So it's worse than just 2 scans per person. Indeed,
the lateral subquery seems better because it accesses the person table and
then the names table once for each person. However, I might instead do
something like this to access each table just once:
explain
select *
from people p
left join (
select *, rank() over(partition by people_id
order by validfrom desc) as rank
from names
) n on p.people_id = n.people_id
and n.rank = 1
QUERY PLAN
Hash Right Join (cost=3427870.26..3942220.56 rows=5000000 width=36)
Hash Cond: (n.people_id = p.people_id)
-> Subquery Scan on n (cost=3263927.26..3751430.31 rows=75000 width=26)
Filter: (n.rank = 1)
-> WindowAgg (cost=3263927.26..3563929.14 rows=15000094 width=18)
-> Sort (cost=3263927.26..3301427.49 rows=15000094 width=18)
Sort Key: names.people_id, names.validfrom
-> Seq Scan on names (cost=0.00..245542.94
rows=15000094 width=18)
-> Hash (cost=77028.00..77028.00 rows=5000000 width=10)
-> Seq Scan on people p (cost=0.00..77028.00 rows=5000000
width=10)
Thanks,
AJ
http://chartio.com/
https://www.linkedin.com/in/ajw0100
On Tue, Dec 9, 2014 at 2:24 AM, Albe Laurenz <laurenz.albe@wien.gv.at>
wrote:
Show quoted text
AJ Welch wrote:
http://blog.heapanalytics.com/postgresqls-powerful-new-join-type-lateral/
I suspected some of the claims in the post may not have been accurate.
This one in particular:
"Without lateral joins, we would need to resort to PL/pgSQL to do this
analysis. Or, if our data set
were small, we could get away with complex, inefficient queries."
The sum(1) and order by time limit 1 approach seemed less than ideal to
me and I thought this analysis
could be done with normal left joins instead of lateral left joins. So I
came up with a proof of
concept:
https://github.com/ajw0100/snippets/tree/master/SQL/lateral
Is my conclusion in the README correct? Does anything beyond
select...from...where force a nested
loop? In that case, is lateral really only useful with set returning
functions as the docs suggest?
Does anyone know of any use cases for lateral that do not involve a set
returning function?
Only recently I used lateral joins to optimize a query.
This is a sample of how the query looked bfore:
SELECT ...
FROM people p
LEFT JOIN names n
ON (n.people_id = p.people_id
AND current_timestamp > n.validfrom
AND NOT EXISTS (SELECT 1 FROM names n2
WHERE n2.people_id = p.people_id
AND current_timestamp > n2.validfrom
AND n2.validfrom > n.validfrom)
)
WHERE p.id = ...So basically it is supposed to find the latest valid name for a person.
This required two scans of the "names" table per "person" record.
I rewrote it as
SELECT ...
FROM people p
LEFT JOIN LATERAL (SELECT * FROM names n
WHERE n.people_id = p.people_id
AND current_timestamp > n.validfrom
ORDER BY n.validfrom DESC LIMIT 1) n
ON TRUE
WHERE p.id = ...With the correct index this touched fewer blocks and worked faster.
Also, though this is of course a matter of taste, it is more readable.Of course this forces a nested loop, but that is not bad as such.
In my case it was not problem (I tried to hint at that with the WHERE
clause).So yes, I think that LATERAL is useful even without set returning
functions.Yours,
Laurenz Albe
On Tue, Dec 9, 2014 at 4:24 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
SELECT ...
FROM people p
LEFT JOIN LATERAL (SELECT * FROM names n
WHERE n.people_id = p.people_id
AND current_timestamp > n.validfrom
ORDER BY n.validfrom DESC LIMIT 1) n
ON TRUE
WHERE p.id = ...With the correct index this touched fewer blocks and worked faster.
Also, though this is of course a matter of taste, it is more readable.Of course this forces a nested loop, but that is not bad as such.
In my case it was not problem (I tried to hint at that with the WHERE clause).
I don't know...forcing a nestloop is a dealbreaker for many
distributions of data.
merlin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general