Improve PostGIS performance with 62 million rows?

Started by Israel Brewsterover 9 years ago15 messagesgeneral
Jump to latest
#1Israel Brewster
israel@ravnalaska.net

I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):

SELECT elevation FROM data WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600) ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ&gt;) shows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.

It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

Attachments:

Israel Brewster.vcftext/directory; name="Israel Brewster.vcf"; x-unix-mode=0666Download
#2Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Israel Brewster (#1)
Re: Improve PostGIS performance with 62 million rows?

The index filters using bounding boxes. A long, diagonal route will have a
large bounding box, relative to the area you actually care about (within a
narrow strip of the route). Use ST_Segmentize() to add points to your
route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
generate new lines from those points, each line very short. The maximum
index effectiveness will come when your line length is close to your buffer
width.

P

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

Show quoted text

I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
latitude (numeric), longitude(numeric), elevation(integer) data, along with
a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
elevation along a path, for which purpose I've come up with the following
query (for one particular path example):

SELECT elevation FROM data

WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'), 600)

ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (
https://explain.depesz.com/s/heZ) shows:

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842
rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82
rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11
A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11
A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
&& _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, '0102000020E6100000020000002C11
A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
'600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to
run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as
it gets with the number of rows being dealt with? I was planning to use
this for a real-time display - punch in a couple of points, get some
information about the route between, including maximum elevation - but with
it taking 22 seconds for the longer routes at least, that doesn't make for
the best user experience.

It's perhaps worth noting that the example above is most likely a worst
case scenario. I would expect the vast majority of routes to be
significantly shorter, and I want to say the shorter routes query much
faster [testing needed]. That said, the faster the better, even for short
routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

#3Israel Brewster
israel@ravnalaska.net
In reply to: Paul Ramsey (#2)
Re: Improve PostGIS performance with 62 million rows?

On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

The index filters using bounding boxes. A long, diagonal route will have a large bounding box, relative to the area you actually care about (within a narrow strip of the route). Use ST_Segmentize() to add points to your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new lines from those points, each line very short. The maximum index effectiveness will come when your line length is close to your buffer width.

P

Ok, I think I understand the concept. So attempting to follow your advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but eventually that's what I came up with. Unfortunately, far from improving performance, it killed it - in running the query, it went from 22 seconds to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the query execution plan shows, at least partially, why:

QUERY PLAN
------------------------------------------------------------------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890 width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using the index. And, of course, sorting 20 million rows is not trivial either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

Show quoted text

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):

SELECT elevation FROM data WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600) ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ&gt;) shows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.

It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

#4Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Israel Brewster (#3)
Re: Improve PostGIS performance with 62 million rows?

Yes, you did. You want a query that spits out a tupleset of goemetries (one
each for each wee segment), and then you can join that set to your main
table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that generates
a pile of wee segments.

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

Show quoted text

On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

The index filters using bounding boxes. A long, diagonal route will have
a large bounding box, relative to the area you actually care about (within
a narrow strip of the route). Use ST_Segmentize() to add points to your
route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
generate new lines from those points, each line very short. The maximum
index effectiveness will come when your line length is close to your buffer
width.

P

Ok, I think I understand the concept. So attempting to follow your advice,
I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but
eventually that's what I came up with. Unfortunately, far from improving
performance, it killed it - in running the query, it went from 22 seconds
to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at
the query execution plan shows, at least partially, why:

QUERY PLAN

------------------------------------------------------------
------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890
width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using
the index. And, of course, sorting 20 million rows is not trivial either.
Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
latitude (numeric), longitude(numeric), elevation(integer) data, along with
a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
elevation along a path, for which purpose I've come up with the following
query (for one particular path example):

SELECT elevation FROM data

WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'), 600)

ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (
https://explain.depesz.com/s/heZ) shows:

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82
rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C1
1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514
0'::geography)
Filter: (('0102000020E6100000020000002
C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
&& _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, '0102000020E6100000020000002C11A8FE41C
062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
'600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to
run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as
it gets with the number of rows being dealt with? I was planning to use
this for a real-time display - punch in a couple of points, get some
information about the route between, including maximum elevation - but with
it taking 22 seconds for the longer routes at least, that doesn't make for
the best user experience.

It's perhaps worth noting that the example above is most likely a worst
case scenario. I would expect the vast majority of routes to be
significantly shorter, and I want to say the shorter routes query much
faster [testing needed]. That said, the faster the better, even for short
routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

#5Israel Brewster
israel@ravnalaska.net
In reply to: Paul Ramsey (#4)
Re: Improve PostGIS performance with 62 million rows?

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries (one each for each wee segment), and then you can join that set to your main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html <http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html&gt;) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4) (actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64) (actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333 width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000 width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256 rows=15 loops=1938)
Index Cond: (location && _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision))
Filter: (((st_makeline(ARRAY[a.geom, b.geom]))::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, (st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still noticeable for a end user, but defiantly usable - and like mentioned, that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

Show quoted text

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

The index filters using bounding boxes. A long, diagonal route will have a large bounding box, relative to the area you actually care about (within a narrow strip of the route). Use ST_Segmentize() to add points to your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new lines from those points, each line very short. The maximum index effectiveness will come when your line length is close to your buffer width.

P

Ok, I think I understand the concept. So attempting to follow your advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but eventually that's what I came up with. Unfortunately, far from improving performance, it killed it - in running the query, it went from 22 seconds to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the query execution plan shows, at least partially, why:

QUERY PLAN
------------------------------------------------------------------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890 width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using the index. And, of course, sorting 20 million rows is not trivial either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):

SELECT elevation FROM data WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600) ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ&gt;) shows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.

It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

#6Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Israel Brewster (#5)
Re: Improve PostGIS performance with 62 million rows?

Varying the segment length upwards might have a salutary effect for a
while, as the efficiency improvement of fewer inner loops battles with the
inefficiency of having more points selected by the index filter. Worth an
experiment.

P

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel@ravnalaska.net>
wrote:

Show quoted text

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries
(one each for each wee segment), and then you can join that set to your
main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that
generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (
http://blog.cleverelephant.ca/2015/02/breaking-
linestring-into-segments.html) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/
RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual
time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual
time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual
time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual
time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4)
(actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64)
(actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333
width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000
width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on
data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256
rows=15 loops=1938)
Index Cond: (location &&
_st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double
precision))
Filter: (((st_makeline(ARRAY[a.geom,
b.geom]))::geography && _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, (st_makeline(ARRAY[a.geom,
b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still
noticeable for a end user, but defiantly usable - and like mentioned,
that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all
ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

The index filters using bounding boxes. A long, diagonal route will have
a large bounding box, relative to the area you actually care about (within
a narrow strip of the route). Use ST_Segmentize() to add points to your
route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
generate new lines from those points, each line very short. The maximum
index effectiveness will come when your line length is close to your buffer
width.

P

Ok, I think I understand the concept. So attempting to follow your
advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but
eventually that's what I came up with. Unfortunately, far from improving
performance, it killed it - in running the query, it went from 22 seconds
to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at
the query execution plan shows, at least partially, why:

QUERY PLAN

------------------------------------------------------------
------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890
width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using
the index. And, of course, sorting 20 million rows is not trivial either.
Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
latitude (numeric), longitude(numeric), elevation(integer) data, along with
a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
elevation along a path, for which purpose I've come up with the following
query (for one particular path example):

SELECT elevation FROM data

WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'), 600)

ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (
https://explain.depesz.com/s/heZ) shows:

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82
rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C1
1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514
0'::geography)
Filter: (('0102000020E6100000020000002
C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
&& _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, '0102000020E6100000020000002C11A8FE41C
062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
'600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to
run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good
as it gets with the number of rows being dealt with? I was planning to use
this for a real-time display - punch in a couple of points, get some
information about the route between, including maximum elevation - but with
it taking 22 seconds for the longer routes at least, that doesn't make for
the best user experience.

It's perhaps worth noting that the example above is most likely a worst
case scenario. I would expect the vast majority of routes to be
significantly shorter, and I want to say the shorter routes query much
faster [testing needed]. That said, the faster the better, even for short
routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

#7Rémi Cura
remi.cura@gmail.com
In reply to: Paul Ramsey (#6)
Re: Improve PostGIS performance with 62 million rows?

Hey,
1 sec seems really good in this case,
and I'm assuming you tuned postgres so the main index fits into ram
(work_mem and all other stuff).

You could avoid a CTE by mixing both cte.

WITH pts AS (
SELECT (pt).geom, (pt).path[1] as vert
FROM
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Then you could remove the useless and (potentially explosive if you have
large number of dump points) inner join on points :
"FROM pts a
INNER JOIN pts b "

You could simply use a window function to generate the segments, like in
here
<https://github.com/Remi-C/PPPP_utilities/blob/master/postgis/rc_DumpSegments.sql#L51&gt;
.
The idea is to dump points, order them by path, and then link each point
with the previous one (function lag).
Assuming you don't want to use the available function,
this would be something like :

WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom) AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL --the first segment is null by design
(lag function)
AND ST_DWithin(location, segments.short_line, 600) = TRUE
ORDER BY elevation DESC
limit 1;

I don't know if you can further improve this query after that,
but I'll guess it would reduce your time and be more secure regarding
scaling.

if you want to further improve your result,
you'll have to reduce the number of row in your index,
that is partition your table into several tables !

This is not easy to do with current postgres partitionning methods as far
as I know
(partitionning is easy, automatic efficient query is hard).

Another way would be to reduce you requirement, and consider that in some
case you may want less details in the altimetry, which would allow you to
use a Level Of Detail approach.

Congrats for the well explained query/problem anyway !
Cheers,
Rémi-C

2017-01-05 23:09 GMT+01:00 Paul Ramsey <pramsey@cleverelephant.ca>:

Show quoted text

Varying the segment length upwards might have a salutary effect for a
while, as the efficiency improvement of fewer inner loops battles with the
inefficiency of having more points selected by the index filter. Worth an
experiment.

P

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel@ravnalaska.net>
wrote:

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries
(one each for each wee segment), and then you can join that set to your
main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that
generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (
http://blog.cleverelephant.ca/2015/02/breaking-linestring-
into-segments.html) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/
RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual
time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual
time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual
time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual
time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4)
(actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64)
(actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333
width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000
width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on
data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256
rows=15 loops=1938)
Index Cond: (location &&
_st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography,
'600'::double precision))
Filter: (((st_makeline(ARRAY[a.geom,
b.geom]))::geography && _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, (st_makeline(ARRAY[a.geom,
b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still
noticeable for a end user, but defiantly usable - and like mentioned,
that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all
ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

The index filters using bounding boxes. A long, diagonal route will
have a large bounding box, relative to the area you actually care about
(within a narrow strip of the route). Use ST_Segmentize() to add points to
your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
generate new lines from those points, each line very short. The maximum
index effectiveness will come when your line length is close to your buffer
width.

P

Ok, I think I understand the concept. So attempting to follow your
advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but
eventually that's what I came up with. Unfortunately, far from improving
performance, it killed it - in running the query, it went from 22 seconds
to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at
the query execution plan shows, at least partially, why:

QUERY PLAN

------------------------------------------------------------
------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890
width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than
using the index. And, of course, sorting 20 million rows is not trivial
either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
latitude (numeric), longitude(numeric), elevation(integer) data, along with
a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
elevation along a path, for which purpose I've come up with the following
query (for one particular path example):

SELECT elevation FROM data

WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'), 600)

ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (
https://explain.depesz.com/s/heZ) shows:

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82
rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C1
1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514
0'::geography)
Filter: (('0102000020E6100000020000002
C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
&& _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, '0102000020E6100000020000002C11A8FE41C
062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
'600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to
run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good
as it gets with the number of rows being dealt with? I was planning to use
this for a real-time display - punch in a couple of points, get some
information about the route between, including maximum elevation - but with
it taking 22 seconds for the longer routes at least, that doesn't make for
the best user experience.

It's perhaps worth noting that the example above is most likely a worst
case scenario. I would expect the vast majority of routes to be
significantly shorter, and I want to say the shorter routes query much
faster [testing needed]. That said, the faster the better, even for short
routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

#8Israel Brewster
israel@ravnalaska.net
In reply to: Paul Ramsey (#6)
Re: Improve PostGIS performance with 62 million rows?

Ah, yes indeed. Upping the segment length to 1,000 brings the execution time down to 642 ms, and further upping it to 10,000 brings the execution time down again to 442.104 ms. I'll have to play around with it and see where the minimum is. Would that be likely to vary depending on initial path length?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

Show quoted text

On Jan 5, 2017, at 1:09 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

Varying the segment length upwards might have a salutary effect for a while, as the efficiency improvement of fewer inner loops battles with the inefficiency of having more points selected by the index filter. Worth an experiment.

P

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries (one each for each wee segment), and then you can join that set to your main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html <http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html&gt;) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4) (actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64) (actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333 width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000 width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256 rows=15 loops=1938)
Index Cond: (location && _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision))
Filter: (((st_makeline(ARRAY[a.geom, b.geom]))::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, (st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still noticeable for a end user, but defiantly usable - and like mentioned, that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

The index filters using bounding boxes. A long, diagonal route will have a large bounding box, relative to the area you actually care about (within a narrow strip of the route). Use ST_Segmentize() to add points to your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new lines from those points, each line very short. The maximum index effectiveness will come when your line length is close to your buffer width.

P

Ok, I think I understand the concept. So attempting to follow your advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but eventually that's what I came up with. Unfortunately, far from improving performance, it killed it - in running the query, it went from 22 seconds to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the query execution plan shows, at least partially, why:

QUERY PLAN
------------------------------------------------------------------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890 width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using the index. And, of course, sorting 20 million rows is not trivial either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):

SELECT elevation FROM data WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600) ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ&gt;) shows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.

It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

Attachments:

Israel Brewster.vcftext/directory; name="Israel Brewster.vcf"; x-unix-mode=0666Download
#9Israel Brewster
israel@ravnalaska.net
In reply to: Rémi Cura (#7)
Re: Improve PostGIS performance with 62 million rows?

On Jan 5, 2017, at 1:38 PM, Rémi Cura <remi.cura@gmail.com> wrote:

Hey,
1 sec seems really good in this case,
and I'm assuming you tuned postgres so the main index fits into ram (work_mem and all other stuff).

You could avoid a CTE by mixing both cte.

WITH pts AS (
SELECT (pt).geom, (pt).path[1] as vert
FROM
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Then you could remove the useless and (potentially explosive if you have large number of dump points) inner join on points :
"FROM pts a
INNER JOIN pts b "

You could simply use a window function to generate the segments, like in here <https://github.com/Remi-C/PPPP_utilities/blob/master/postgis/rc_DumpSegments.sql#L51&gt;.
The idea is to dump points, order them by path, and then link each point with the previous one (function lag).
Assuming you don't want to use the available function,
this would be something like :

WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom) AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL --the first segment is null by design (lag function)
AND ST_DWithin(location, segments.short_line, 600) = TRUE
ORDER BY elevation DESC
limit 1;

I don't know if you can further improve this query after that,
but I'll guess it would reduce your time and be more secure regarding scaling.

if you want to further improve your result,
you'll have to reduce the number of row in your index,
that is partition your table into several tables !

This is not easy to do with current postgres partitionning methods as far as I know
(partitionning is easy, automatic efficient query is hard).

Another way would be to reduce you requirement, and consider that in some case you may want less details in the altimetry, which would allow you to use a Level Of Detail approach.

Congrats for the well explained query/problem anyway !
Cheers,
Rémi-C

Ooooh, nice use of a window function - that change right there cut the execution time in half! I was able to shave off a few hundreds of a second more but tweaking the ST_Segmentize length parameter up to 5,000 (still have to play with that number some), so execution time is now down to the sub-300ms range. If I reduce the radius I am looking around the line, I can additionally improve the time to around 200 ms, but I'm not sure that will be an option. Regardless, 300ms is rather impressive, I think. Thanks!

Show quoted text

2017-01-05 23:09 GMT+01:00 Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>>:
Varying the segment length upwards might have a salutary effect for a while, as the efficiency improvement of fewer inner loops battles with the inefficiency of having more points selected by the index filter. Worth an experiment.

P

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries (one each for each wee segment), and then you can join that set to your main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html <http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html&gt;) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4) (actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64) (actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333 width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000 width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256 rows=15 loops=1938)
Index Cond: (location && _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision))
Filter: (((st_makeline(ARRAY[a.geom, b.geom]))::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, (st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still noticeable for a end user, but defiantly usable - and like mentioned, that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

The index filters using bounding boxes. A long, diagonal route will have a large bounding box, relative to the area you actually care about (within a narrow strip of the route). Use ST_Segmentize() to add points to your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new lines from those points, each line very short. The maximum index effectiveness will come when your line length is close to your buffer width.

P

Ok, I think I understand the concept. So attempting to follow your advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but eventually that's what I came up with. Unfortunately, far from improving performance, it killed it - in running the query, it went from 22 seconds to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the query execution plan shows, at least partially, why:

QUERY PLAN
------------------------------------------------------------------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890 width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using the index. And, of course, sorting 20 million rows is not trivial either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):

SELECT elevation FROM data WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600) ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ&gt;) shows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.

It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

#10Israel Brewster
israel@ravnalaska.net
In reply to: Israel Brewster (#9)
Re: Improve PostGIS performance with 62 million rows?

So just for interests sake, to kick things up a notch (and out of sheer morbid curiosity), I loaded a higher-resolution dataset (Elevation data for the state of Alaska, 2 arc second resolution, as opposed to 100 meter resolution before). Same structure/indexes and everything, just higher resolution. So the new database has 1,642,700,002 rows, and is somewhere around 300GB in size (including index). Due to the larger data size, I moved the database to a different table space which resides on a mirrored 2TB spinning platter disk (i.e. slower both because of the RAID and lack of SSD). Friday evening I ran the following query:

EXPLAIN ANALYZE WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom)::GEOGRAPHY AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
5000
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL
AND ST_DWithin(location, segments.short_line, 100) = TRUE
ORDER BY elevation DESC
limit 1;

Which is the same query that took around 300 ms on the smaller dataset. The result was this (https://explain.depesz.com/s/mKFN <https://explain.depesz.com/s/mKFN&gt;):

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual time=225998.319..225998.320 rows=1 loops=1)
CTE segments
-> WindowAgg (cost=60.08..82.58 rows=1000 width=64) (actual time=0.488..4.032 rows=234 loops=1)
-> Sort (cost=60.08..62.58 rows=1000 width=64) (actual time=0.460..0.875 rows=234 loops=1)
Sort Key: pt.path
Sort Method: quicksort Memory: 57kB
-> Function Scan on st_dumppoints pt (cost=0.25..10.25 rows=1000 width=64) (actual time=0.354..0.387 rows=234 loops=1)
-> Sort (cost=354643753.25..354645115.32 rows=544829 width=9) (actual time=225998.319..225998.319 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.68..354641029.10 rows=544829 width=9) (actual time=349.784..225883.557 rows=159654 loops=1)
-> CTE Scan on segments (cost=0.00..20.00 rows=995 width=32) (actual time=0.500..4.823 rows=233 loops=1)
Filter: (short_line IS NOT NULL)
Rows Removed by Filter: 1
-> Index Scan using location_gist_idx on data (cost=0.68..356423.07 rows=5 width=41) (actual time=71.416..969.196 rows=685 loops=233)
Index Cond: (location && _st_expand(segments.short_line, '100'::double precision))
Filter: ((segments.short_line && _st_expand(location, '100'::double precision)) AND _st_dwithin(location, segments.short_line, '100'::double precision, true))
Rows Removed by Filter: 8011
Planning time: 4.554 ms
Execution time: 225998.839 ms
(20 rows)

So a little less than four minutes. Not bad (given the size of the database), or so I thought.

This morning (so a couple of days later) I ran the query again without the explain analyze to check the results, and noticed that it didn't take anywhere near four minutes to execute. So I ran the explain analyze again, and got this:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual time=9636.165..9636.166 rows=1 loops=1)
CTE segments
-> WindowAgg (cost=60.08..82.58 rows=1000 width=64) (actual time=0.345..1.137 rows=234 loops=1)
-> Sort (cost=60.08..62.58 rows=1000 width=64) (actual time=0.335..0.428 rows=234 loops=1)
Sort Key: pt.path
Sort Method: quicksort Memory: 57kB
-> Function Scan on st_dumppoints pt (cost=0.25..10.25 rows=1000 width=64) (actual time=0.198..0.230 rows=234 loops=1)
-> Sort (cost=354643753.25..354645115.32 rows=544829 width=9) (actual time=9636.165..9636.165 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.68..354641029.10 rows=544829 width=9) (actual time=1.190..9602.606 rows=159654 loops=1)
-> CTE Scan on segments (cost=0.00..20.00 rows=995 width=32) (actual time=0.361..1.318 rows=233 loops=1)
Filter: (short_line IS NOT NULL)
Rows Removed by Filter: 1
-> Index Scan using location_gist_idx on data (cost=0.68..356423.07 rows=5 width=41) (actual time=0.372..41.126 rows=685 loops=233)
Index Cond: (location && _st_expand(segments.short_line, '100'::double precision))
Filter: ((segments.short_line && _st_expand(location, '100'::double precision)) AND _st_dwithin(location, segments.short_line, '100'::double precision, true))
Rows Removed by Filter: 8011
Planning time: 0.941 ms
Execution time: 9636.285 ms
(20 rows)

So from four minutes on the first run to around 9 1/2 seconds on the second. Presumably this difference is due to caching? I would have expected any caches to have expired by the time I made the second run, but the data *is* static, so I guess not. Otherwise, I don't know how to explain the improvement on the second run - the query plans appear identical (at least to me). *IS* there something else (for example, auto vacuum running over the weekend) that could explain the performance difference?

Assuming this performance difference *is* due to caching, that brings up a couple of questions for me:

1) Is there any way to "force" PostgreSQL to cache the data? Keep in mind that the database is close to a couple of hundred Gigs of data, so there is no way it can all fit in RAM.

2) In lieu of forcing a cache (which is probably not going to work well, even if possible), what could I do to help ensure that performance is closer to the 9 second mark than the 4 minute mark in general? For example, would it be likely to make a significant difference if I was to add a couple of larger SSD's to hold this data and put them in a stripe RAID (rather than the mirrored 7200 RPM platter drives it is on now)? Since the data is static, loosing the data due to drive failure is of little concern to me. Or would adding more RAM (and tweaking PostgreSQL settings) to be able to increase the cache size help more, even though there would still not be enough to cache everything?

In the end, the low resolution data is probably good enough, and I may be able to come up with some sort of method to use them both - i.e. return a result quickly from the low resolution dataset, while simultaneously firing off the same request to the high resolution dataset, and returning that result when ready, or only using the high-resolution data set when explicitly requested. So having to wait four minutes on occasion for a result from the high-resolution set may not be an issue. That said, it would be nice to know all the options I can present to my boss :-)

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

Show quoted text

On Jan 5, 2017, at 1:55 PM, Israel Brewster <israel@ravnalaska.net> wrote:

On Jan 5, 2017, at 1:38 PM, Rémi Cura <remi.cura@gmail.com <mailto:remi.cura@gmail.com>> wrote:

Hey,
1 sec seems really good in this case,
and I'm assuming you tuned postgres so the main index fits into ram (work_mem and all other stuff).

You could avoid a CTE by mixing both cte.

WITH pts AS (
SELECT (pt).geom, (pt).path[1] as vert
FROM
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Then you could remove the useless and (potentially explosive if you have large number of dump points) inner join on points :
"FROM pts a
INNER JOIN pts b "

You could simply use a window function to generate the segments, like in here <https://github.com/Remi-C/PPPP_utilities/blob/master/postgis/rc_DumpSegments.sql#L51&gt;.
The idea is to dump points, order them by path, and then link each point with the previous one (function lag).
Assuming you don't want to use the available function,
this would be something like :

WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom) AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL --the first segment is null by design (lag function)
AND ST_DWithin(location, segments.short_line, 600) = TRUE
ORDER BY elevation DESC
limit 1;

I don't know if you can further improve this query after that,
but I'll guess it would reduce your time and be more secure regarding scaling.

if you want to further improve your result,
you'll have to reduce the number of row in your index,
that is partition your table into several tables !

This is not easy to do with current postgres partitionning methods as far as I know
(partitionning is easy, automatic efficient query is hard).

Another way would be to reduce you requirement, and consider that in some case you may want less details in the altimetry, which would allow you to use a Level Of Detail approach.

Congrats for the well explained query/problem anyway !
Cheers,
Rémi-C

Ooooh, nice use of a window function - that change right there cut the execution time in half! I was able to shave off a few hundreds of a second more but tweaking the ST_Segmentize length parameter up to 5,000 (still have to play with that number some), so execution time is now down to the sub-300ms range. If I reduce the radius I am looking around the line, I can additionally improve the time to around 200 ms, but I'm not sure that will be an option. Regardless, 300ms is rather impressive, I think. Thanks!

2017-01-05 23:09 GMT+01:00 Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>>:
Varying the segment length upwards might have a salutary effect for a while, as the efficiency improvement of fewer inner loops battles with the inefficiency of having more points selected by the index filter. Worth an experiment.

P

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries (one each for each wee segment), and then you can join that set to your main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html <http://blog.cleverelephant.ca/2015/02/breaking-linestring-into-segments.html&gt;) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (https://explain.depesz.com/s/RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4) (actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64) (actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333 width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000 width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256 rows=15 loops=1938)
Index Cond: (location && _st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision))
Filter: (((st_makeline(ARRAY[a.geom, b.geom]))::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, (st_makeline(ARRAY[a.geom, b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still noticeable for a end user, but defiantly usable - and like mentioned, that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca <mailto:pramsey@cleverelephant.ca>> wrote:

The index filters using bounding boxes. A long, diagonal route will have a large bounding box, relative to the area you actually care about (within a narrow strip of the route). Use ST_Segmentize() to add points to your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to generate new lines from those points, each line very short. The maximum index effectiveness will come when your line length is close to your buffer width.

P

Ok, I think I understand the concept. So attempting to follow your advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept, but eventually that's what I came up with. Unfortunately, far from improving performance, it killed it - in running the query, it went from 22 seconds to several minutes (EXPLAIn ANALYZE has yet to return a result). Looking at the query execution plan shows, at least partially, why:

QUERY PLAN
------------------------------------------------------------------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890 width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than using the index. And, of course, sorting 20 million rows is not trivial either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net <mailto:israel@ravnalaska.net>> wrote:
I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of latitude (numeric), longitude(numeric), elevation(integer) data, along with a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum elevation along a path, for which purpose I've come up with the following query (for one particular path example):

SELECT elevation FROM data WHERE ST_DWithin(location, ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056 61.179167,-156.77 71.285833)'), 600) ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (https://explain.depesz.com/s/heZ <https://explain.depesz.com/s/heZ&gt;) shows:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82 rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography)
Filter: (('0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography && _st_expand(location, '600'::double precision)) AND _st_dwithin(location, '0102000020E6100000020000002C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography, '600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good as it gets with the number of rows being dealt with? I was planning to use this for a real-time display - punch in a couple of points, get some information about the route between, including maximum elevation - but with it taking 22 seconds for the longer routes at least, that doesn't make for the best user experience.

It's perhaps worth noting that the example above is most likely a worst case scenario. I would expect the vast majority of routes to be significantly shorter, and I want to say the shorter routes query much faster [testing needed]. That said, the faster the better, even for short routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293 <tel:(907)%20450-7293>
-----------------------------------------------

Attachments:

Israel Brewster.vcftext/directory; name="Israel Brewster.vcf"; x-unix-mode=0666Download
#11Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Israel Brewster (#10)
Re: Improve PostGIS performance with 62 million rows?

At BILLIONS, you're getting to a point where the point index is probably
(a) very large and (b) very deep, so you might want to do something
different with your data storage, like loading the data in spatially
compact patches of several 10s of points. Then the index will float more
nicely in memory, and be faster to traverse. Something like pgpointcloud
may start to look like it has some advantages.

WRT your time differences, make sure to try the same query but with
*different routes*. I find that often a slow query gets fast if I run it
twice identically, but if I run it twice with different parameterizations I
see slower execution. Basically the second time you're seeing some caching
of the immediately important blocks, but not necessarily every block you
might need for every case.

P.

On Mon, Jan 9, 2017 at 9:49 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

Show quoted text

So just for interests sake, to kick things up a notch (and out of sheer
morbid curiosity), I loaded a higher-resolution dataset (Elevation data for
the state of Alaska, 2 arc second resolution, as opposed to 100 meter
resolution before). Same structure/indexes and everything, just higher
resolution. So the new database has 1,642,700,002 rows, and is somewhere
around 300GB in size (including index). Due to the larger data size, I
moved the database to a different table space which resides on a mirrored
2TB spinning platter disk (i.e. slower both because of the RAID and lack of
SSD). Friday evening I ran the following query:

EXPLAIN ANALYZE WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom)::GEOGRAPHY AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
5000
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL
AND ST_DWithin(location, segments.short_line, 100) = TRUE
ORDER BY elevation DESC
limit 1;

Which is the same query that took around 300 ms on the smaller dataset.
The result was this (https://explain.depesz.com/s/mKFN):

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual
time=225998.319..225998.320 rows=1 loops=1)
CTE segments
-> WindowAgg (cost=60.08..82.58 rows=1000 width=64) (actual
time=0.488..4.032 rows=234 loops=1)
-> Sort (cost=60.08..62.58 rows=1000 width=64) (actual
time=0.460..0.875 rows=234 loops=1)
Sort Key: pt.path
Sort Method: quicksort Memory: 57kB
-> Function Scan on st_dumppoints pt (cost=0.25..10.25
rows=1000 width=64) (actual time=0.354..0.387 rows=234 loops=1)
-> Sort (cost=354643753.25..354645115.32 rows=544829 width=9)
(actual time=225998.319..225998.319 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.68..354641029.10 rows=544829 width=9)
(actual time=349.784..225883.557 rows=159654 loops=1)
-> CTE Scan on segments (cost=0.00..20.00 rows=995
width=32) (actual time=0.500..4.823 rows=233 loops=1)
Filter: (short_line IS NOT NULL)
Rows Removed by Filter: 1
-> Index Scan using location_gist_idx on
data (cost=0.68..356423.07 rows=5 width=41) (actual time=71.416..969.196
rows=685 loops=233)
Index Cond: (location && _st_expand(segments.short_line,
'100'::double precision))
Filter: ((segments.short_line && _st_expand(location,
'100'::double precision)) AND _st_dwithin(location, segments.short_line,
'100'::double precision, true))
Rows Removed by Filter: 8011
Planning time: 4.554 ms
Execution time: 225998.839 ms
(20 rows)

So a little less than four minutes. Not bad (given the size of the
database), or so I thought.

This morning (so a couple of days later) I ran the query again without the
explain analyze to check the results, and noticed that it didn't take
anywhere near four minutes to execute. So I ran the explain analyze again,
and got this:

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual
time=9636.165..9636.166 rows=1 loops=1)
CTE segments
-> WindowAgg (cost=60.08..82.58 rows=1000 width=64) (actual
time=0.345..1.137 rows=234 loops=1)
-> Sort (cost=60.08..62.58 rows=1000 width=64) (actual
time=0.335..0.428 rows=234 loops=1)
Sort Key: pt.path
Sort Method: quicksort Memory: 57kB
-> Function Scan on st_dumppoints pt (cost=0.25..10.25
rows=1000 width=64) (actual time=0.198..0.230 rows=234 loops=1)
-> Sort (cost=354643753.25..354645115.32 rows=544829 width=9)
(actual time=9636.165..9636.165 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.68..354641029.10 rows=544829 width=9)
(actual time=1.190..9602.606 rows=159654 loops=1)
-> CTE Scan on segments (cost=0.00..20.00 rows=995
width=32) (actual time=0.361..1.318 rows=233 loops=1)
Filter: (short_line IS NOT NULL)
Rows Removed by Filter: 1
-> Index Scan using location_gist_idx on
data (cost=0.68..356423.07 rows=5 width=41) (actual time=0.372..41.126
rows=685 loops=233)
Index Cond: (location && _st_expand(segments.short_line,
'100'::double precision))
Filter: ((segments.short_line && _st_expand(location,
'100'::double precision)) AND _st_dwithin(location, segments.short_line,
'100'::double precision, true))
Rows Removed by Filter: 8011
Planning time: 0.941 ms
Execution time: 9636.285 ms
(20 rows)

So from four minutes on the first run to around 9 1/2 seconds on the
second. Presumably this difference is due to caching? I would have expected
any caches to have expired by the time I made the second run, but the data
*is* static, so I guess not. Otherwise, I don't know how to explain the
improvement on the second run - the query plans appear identical (at least
to me). *IS* there something else (for example, auto vacuum running over
the weekend) that could explain the performance difference?

Assuming this performance difference *is* due to caching, that brings up a
couple of questions for me:

1) Is there any way to "force" PostgreSQL to cache the data? Keep in mind
that the database is close to a couple of hundred Gigs of data, so there is
no way it can all fit in RAM.

2) In lieu of forcing a cache (which is probably not going to work well,
even if possible), what could I do to help ensure that performance is
closer to the 9 second mark than the 4 minute mark in general? For example,
would it be likely to make a significant difference if I was to add a
couple of larger SSD's to hold this data and put them in a stripe RAID
(rather than the mirrored 7200 RPM platter drives it is on now)? Since the
data is static, loosing the data due to drive failure is of little concern
to me. Or would adding more RAM (and tweaking PostgreSQL settings) to be
able to increase the cache size help more, even though there would still
not be enough to cache everything?

In the end, the low resolution data is probably good enough, and I may be
able to come up with some sort of method to use them both - i.e. return a
result quickly from the low resolution dataset, while simultaneously firing
off the same request to the high resolution dataset, and returning that
result when ready, or only using the high-resolution data set when
explicitly requested. So having to wait four minutes on occasion for a
result from the high-resolution set may not be an issue. That said, it
would be nice to know all the options I can present to my boss :-)

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Jan 5, 2017, at 1:55 PM, Israel Brewster <israel@ravnalaska.net> wrote:

On Jan 5, 2017, at 1:38 PM, Rémi Cura <remi.cura@gmail.com> wrote:

Hey,
1 sec seems really good in this case,
and I'm assuming you tuned postgres so the main index fits into ram
(work_mem and all other stuff).

You could avoid a CTE by mixing both cte.

WITH pts AS (
SELECT (pt).geom, (pt).path[1] as vert
FROM
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Then you could remove the useless and (potentially explosive if you have
large number of dump points) inner join on points :
"FROM pts a
INNER JOIN pts b "

You could simply use a window function to generate the segments, like in
here
<https://github.com/Remi-C/PPPP_utilities/blob/master/postgis/rc_DumpSegments.sql#L51&gt;
.
The idea is to dump points, order them by path, and then link each point
with the previous one (function lag).
Assuming you don't want to use the available function,
this would be something like :

WITH segments AS (
SELECT ST_MakeLine( lag((pt).geom , 1, NULL) OVER (ORDER BY (pt).path)
,(pt).geom) AS short_line
FROM ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
)
SELECT elevation
FROM data ,segments
WHERE segments.short_line IS NOT NULL --the first segment is null by
design (lag function)
AND ST_DWithin(location, segments.short_line, 600) = TRUE
ORDER BY elevation DESC
limit 1;

I don't know if you can further improve this query after that,
but I'll guess it would reduce your time and be more secure regarding
scaling.

if you want to further improve your result,
you'll have to reduce the number of row in your index,
that is partition your table into several tables !

This is not easy to do with current postgres partitionning methods as far
as I know
(partitionning is easy, automatic efficient query is hard).

Another way would be to reduce you requirement, and consider that in some
case you may want less details in the altimetry, which would allow you to
use a Level Of Detail approach.

Congrats for the well explained query/problem anyway !
Cheers,
Rémi-C

Ooooh, nice use of a window function - that change right there cut the
execution time in half! I was able to shave off a few hundreds of a second
more but tweaking the ST_Segmentize length parameter up to 5,000 (still
have to play with that number some), so execution time is now down to the
sub-300ms range. If I reduce the radius I am looking around the line, I
can additionally improve the time to around 200 ms, but I'm not sure that
will be an option. Regardless, 300ms is rather impressive, I think. Thanks!

2017-01-05 23:09 GMT+01:00 Paul Ramsey <pramsey@cleverelephant.ca>:

Varying the segment length upwards might have a salutary effect for a
while, as the efficiency improvement of fewer inner loops battles with the
inefficiency of having more points selected by the index filter. Worth an
experiment.

P

On Thu, Jan 5, 2017 at 1:00 PM, Israel Brewster <israel@ravnalaska.net>
wrote:

On Jan 5, 2017, at 10:38 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

Yes, you did. You want a query that spits out a tupleset of goemetries
(one each for each wee segment), and then you can join that set to your
main table using st_dwithin() as the join clause.
So start by ditching the main table and just work on a query that
generates a pile of wee segments.

Ahhh, I see you've done this sort of thing before (
http://blog.cleverelephant.ca/2015/02/breaking-linestring-i
nto-segments.html) :-)

So following that advice I came up with the following query:

WITH dump AS (SELECT
ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
) as pt
),
pts AS (
SELECT (pt).geom, (pt).path[1] as vert FROM dump
)
SELECT elevation
FROM data
INNER JOIN (SELECT
ST_MakeLine(ARRAY[a.geom, b.geom]) as short_line
FROM pts a
INNER JOIN pts b
ON a.vert=b.vert-1 AND b.vert>1) segments
ON ST_DWithin(location, segments.short_line, 600)
ORDER BY elevation DESC limit 1;

Which yields the following EXPLAIN ANALYZE (
https://explain.depesz.com/s/RsTD <https://explain.depesz.com/s/ukwc&gt;):

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--------------------------------------------------------
Limit (cost=11611706.90..11611706.91 rows=1 width=4) (actual
time=1171.814..1171.814 rows=1 loops=1)
CTE dump
-> Result (cost=0.00..5.25 rows=1000 width=32) (actual
time=0.024..1.989 rows=1939 loops=1)
CTE pts
-> CTE Scan on dump (cost=0.00..20.00 rows=1000 width=36) (actual
time=0.032..4.071 rows=1939 loops=1)
-> Sort (cost=11611681.65..11611768.65 rows=34800 width=4) (actual
time=1171.813..1171.813 rows=1 loops=1)
Sort Key: data.elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Nested Loop (cost=0.55..11611507.65 rows=34800 width=4)
(actual time=0.590..1167.615 rows=28408 loops=1)
-> Nested Loop (cost=0.00..8357.50 rows=1665 width=64)
(actual time=0.046..663.475 rows=1938 loops=1)
Join Filter: (a.vert = (b.vert - 1))
Rows Removed by Join Filter: 3755844
-> CTE Scan on pts b (cost=0.00..22.50 rows=333
width=36) (actual time=0.042..0.433 rows=1938 loops=1)
Filter: (vert > 1)
Rows Removed by Filter: 1
-> CTE Scan on pts a (cost=0.00..20.00 rows=1000
width=36) (actual time=0.000..0.149 rows=1939 loops=1938)
-> Index Scan using location_gix on
data (cost=0.55..6968.85 rows=1 width=36) (actual time=0.085..0.256
rows=15 loops=1938)
Index Cond: (location &&
_st_expand((st_makeline(ARRAY[a.geom, b.geom]))::geography,
'600'::double precision))
Filter: (((st_makeline(ARRAY[a.geom,
b.geom]))::geography && _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, (st_makeline(ARRAY[a.geom,
b.geom]))::geography, '600'::double precision, true))
Rows Removed by Filter: 7
Planning time: 4.318 ms
Execution time: 1171.994 ms
(22 rows)

So not bad. Went from 20+ seconds to a little over 1 second. Still
noticeable for a end user, but defiantly usable - and like mentioned,
that's a worst-case scenario query. Thanks!

Of course, if you have any suggestions for further improvement, I'm all
ears :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 11:36 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

On Jan 5, 2017, at 8:50 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

The index filters using bounding boxes. A long, diagonal route will
have a large bounding box, relative to the area you actually care about
(within a narrow strip of the route). Use ST_Segmentize() to add points to
your route, ST_DumpPoints() to dump those out as point and ST_MakeLine to
generate new lines from those points, each line very short. The maximum
index effectiveness will come when your line length is close to your buffer
width.

P

Ok, I think I understand the concept. So attempting to follow your
advice, I modified the query to be:

SELECT elevation
FROM data
WHERE
ST_DWithin(
location,
(SELECT ST_MakeLine(geom)::geography as split_line
FROM (SELECT
(ST_DumpPoints(
ST_Segmentize(
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'),
600
)::geometry
)).geom
) s1),
600
)
ORDER BY elevation DESC limit 1;

It took some fiddling to find a syntax that Postgresql would accept,
but eventually that's what I came up with. Unfortunately, far from
improving performance, it killed it - in running the query, it went from 22
seconds to several minutes (EXPLAIn ANALYZE has yet to return a result).
Looking at the query execution plan shows, at least partially, why:

QUERY PLAN

------------------------------------------------------------
------------------
Limit (cost=17119748.98..17119748.98 rows=1 width=4)
InitPlan 1 (returns $0)
-> Aggregate (cost=17.76..17.77 rows=1 width=32)
-> Result (cost=0.00..5.25 rows=1000 width=32)
-> Sort (cost=17119731.21..17171983.43 rows=20900890 width=4)
Sort Key: data.elevation DESC
-> Seq Scan on data (cost=0.00..17015226.76 rows=20900890
width=4)
Filter: st_dwithin(location, $0, '600'::double precision)
(8 rows)

So apparently it is now doing a sequential scan on data rather than
using the index. And, of course, sorting 20 million rows is not trivial
either. Did I do something wrong with forming the query?

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

On Thu, Jan 5, 2017 at 9:45 AM, Israel Brewster <israel@ravnalaska.net>
wrote:

I have a database (PostgreSQL 9.6.1) containing 62,702,675 rows of
latitude (numeric), longitude(numeric), elevation(integer) data, along with
a PostGIS (2.3.0) geometry column (location), running on a CentOS 6.8 box
with 64GB RAM and a RAID10 SSD data drive. I'm trying to get the maximum
elevation along a path, for which purpose I've come up with the following
query (for one particular path example):

SELECT elevation FROM data

WHERE ST_DWithin(location,
ST_GeographyFromText('SRID=4326;LINESTRING(-150.008056
61.179167,-156.77 71.285833)'), 600)

ORDER BY elevation LIMIT 1;

The EXPLAIN ANALYZE output of this particular query (
https://explain.depesz.com/s/heZ) shows:

QUERY PLAN

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------
Limit (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.840..22653.842 rows=1 loops=1)
-> Sort (cost=4.83..4.83 rows=1 width=4) (actual
time=22653.837..22653.837 rows=1 loops=1)
Sort Key: elevation DESC
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using location_gix on data (cost=0.42..4.82
rows=1 width=4) (actual time=15.786..22652.041 rows=11081 loops=1)
Index Cond: (location && '0102000020E6100000020000002C1
1A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD2514
0'::geography)
Filter: (('0102000020E6100000020000002
C11A8FE41C062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography
&& _st_expand(location, '600'::double precision)) AND
_st_dwithin(location, '0102000020E6100000020000002C11A8FE41C
062C0DFC2BAF1EE964E40713D0AD7A39863C086C77E164BD25140'::geography,
'600'::double precision, true))
Rows Removed by Filter: 4934534
Planning time: 0.741 ms
Execution time: 22653.906 ms
(10 rows)

So it is using the index properly, but still takes a good 22 seconds
to run, most of which appears to be in the Index Scan.

Is there any way to improve this, or is this going to be about as good
as it gets with the number of rows being dealt with? I was planning to use
this for a real-time display - punch in a couple of points, get some
information about the route between, including maximum elevation - but with
it taking 22 seconds for the longer routes at least, that doesn't make for
the best user experience.

It's perhaps worth noting that the example above is most likely a
worst case scenario. I would expect the vast majority of routes to be
significantly shorter, and I want to say the shorter routes query much
faster [testing needed]. That said, the faster the better, even for short
routes :-)
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

#12Jonathan Vanasco
postgres@2xlp.com
In reply to: Israel Brewster (#10)
Re: Improve PostGIS performance with 62 million rows?

On Jan 9, 2017, at 12:49 PM, Israel Brewster wrote:

Planning time: 4.554 ms
Execution time: 225998.839 ms
(20 rows)

So a little less than four minutes. Not bad (given the size of the database), or so I thought.

This morning (so a couple of days later) I ran the query again without the explain analyze to check the results, and noticed that it didn't take anywhere near four minutes to execute. So I ran the explain analyze again, and got this:

...

Planning time: 0.941 ms
Execution time: 9636.285 ms
(20 rows)

So from four minutes on the first run to around 9 1/2 seconds on the second. Presumably this difference is due to caching? I would have expected any caches to have expired by the time I made the second run, but the data *is* static, so I guess not. Otherwise, I don't know how to explain the improvement on the second run - the query plans appear identical (at least to me). *IS* there something else (for example, auto vacuum running over the weekend) that could explain the performance difference?

This may sound crazy, but I suggest running each of these scenarios 3+ times:

# cold explain
stop postgres
start postgres
explain analyze SELECT

# cold select
stop postgres
start postgres
enable \t for query timing
SELECT

# cold explain to select
stop postgres
start postgres
explain analyze SELECT
enable \t for query timing
SELECT

# cold select to explain
stop postgres
start postgres
enable \t for query timing
SELECT
explain analyze SELECT

# cold select to select
stop postgres
start postgres
enable \t for query timing
SELECT
SELECT

I've found the timing for "Explain Analyze" to be incredibly different from an actual SELECT on complex/large dataset queries... and the differences don't seem to correlate to possible speedups from index/table caching.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13Rémi Cura
remi.cura@gmail.com
In reply to: Jonathan Vanasco (#12)
Re: Improve PostGIS performance with 62 million rows?

Hey,
I like your curiosity !

At the billion range, you __have__ to use pgpointcloud,
pyramid raster solution (actually the more common way to perform this task)
or another database (hello monetdb).
Cheers,
Rémi-C

2017-01-09 20:11 GMT+01:00 Jonathan Vanasco <postgres@2xlp.com>:

Show quoted text

On Jan 9, 2017, at 12:49 PM, Israel Brewster wrote:

Planning time: 4.554 ms
Execution time: 225998.839 ms
(20 rows)

So a little less than four minutes. Not bad (given the size of the

database), or so I thought.

This morning (so a couple of days later) I ran the query again without

the explain analyze to check the results, and noticed that it didn't take
anywhere near four minutes to execute. So I ran the explain analyze again,
and got this:

...

Planning time: 0.941 ms
Execution time: 9636.285 ms
(20 rows)

So from four minutes on the first run to around 9 1/2 seconds on the

second. Presumably this difference is due to caching? I would have expected
any caches to have expired by the time I made the second run, but the data
*is* static, so I guess not. Otherwise, I don't know how to explain the
improvement on the second run - the query plans appear identical (at least
to me). *IS* there something else (for example, auto vacuum running over
the weekend) that could explain the performance difference?

This may sound crazy, but I suggest running each of these scenarios 3+
times:

# cold explain
stop postgres
start postgres
explain analyze SELECT

# cold select
stop postgres
start postgres
enable \t for query timing
SELECT

# cold explain to select
stop postgres
start postgres
explain analyze SELECT
enable \t for query timing
SELECT

# cold select to explain
stop postgres
start postgres
enable \t for query timing
SELECT
explain analyze SELECT

# cold select to select
stop postgres
start postgres
enable \t for query timing
SELECT
SELECT

I've found the timing for "Explain Analyze" to be incredibly different
from an actual SELECT on complex/large dataset queries... and the
differences don't seem to correlate to possible speedups from index/table
caching.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Israel Brewster (#10)
Re: Improve PostGIS performance with 62 million rows?

On Mon, Jan 9, 2017 at 11:49 AM, Israel Brewster <israel@ravnalaska.net> wrote:

[load of new data]

Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual
time=225998.319..225998.320 rows=1 loops=1)

[...] I ran the query again [...]

Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual
time=9636.165..9636.166 rows=1 loops=1)

So from four minutes on the first run to around 9 1/2 seconds on the second.
Presumably this difference is due to caching?

It is likely to be, at least in part. Did you run VACUUM on the
data before the first run? If not, hint bits may be another part
of it. The first access to each page after the bulk load would
require some extra work for visibility checking and would cause a
page rewrite for the hint bits.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15Israel Brewster
israel@ravnalaska.net
In reply to: Kevin Grittner (#14)
Re: Improve PostGIS performance with 62 million rows?

On Jan 9, 2017, at 1:54 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

On Mon, Jan 9, 2017 at 11:49 AM, Israel Brewster <israel@ravnalaska.net> wrote:

[load of new data]

Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual
time=225998.319..225998.320 rows=1 loops=1)

[...] I ran the query again [...]

Limit (cost=354643835.82..354643835.83 rows=1 width=9) (actual
time=9636.165..9636.166 rows=1 loops=1)

So from four minutes on the first run to around 9 1/2 seconds on the second.
Presumably this difference is due to caching?

It is likely to be, at least in part. Did you run VACUUM on the
data before the first run? If not, hint bits may be another part
of it. The first access to each page after the bulk load would
require some extra work for visibility checking and would cause a
page rewrite for the hint bits.

That could be - I had planned to run a VACUUM ANALYZE after creating the indexes, but forgot. By the time I got around to running the second query, autovacuum should have kicked in and done it for me.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general