Indexing large table of coordinates with GiST

Started by Daniel Beginabout 11 years ago12 messagesgeneral
Jump to latest
#1Daniel Begin
jfd553@hotmail.com

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Vick Khera
vivek@khera.org
In reply to: Daniel Begin (#1)
Re: Indexing large table of coordinates with GiST

I'd restructure the table to be split into perhaps 100 or so inherited
tables (or more). That many rows in a table are usually not efficient with
postgres in my experience. My target is to keep the tables under about 100
million rows. I slice them up based on the common query patterns, usually
by some ID number modulo 100. I don't really ever use date ranges like most
tutorials you'll see will suggest.

On Thu, Jan 15, 2015 at 7:44 AM, Daniel Begin <jfd553@hotmail.com> wrote:

Show quoted text

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there
are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Andy Colson
andy@squeakycode.net
In reply to: Daniel Begin (#1)
Re: Indexing large table of coordinates with GiST

On 1/15/2015 6:44 AM, Daniel Begin wrote:

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

Set maintenance_work_mem as large as you can.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Rob Sargent
robjsargent@gmail.com
In reply to: Daniel Begin (#1)
Re: Indexing large table of coordinates with GiST

On 01/15/2015 05:44 AM, Daniel Begin wrote:

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

Daniel could you please supply the server hardware (cpu and storage)
you're using for this data. I have a similar number of records and
would like to know what it takes to handle such load.

TIA

#5Rémi Cura
remi.cura@gmail.com
In reply to: Andy Colson (#3)
Re: Indexing large table of coordinates with GiST

Hey,
You may want to post this on postGIS list.

I take that so many rows mean either raster or point cloud.
If it is point cloud simply consider using pg_pointcloud.
A 6 billion point cloud is about 600 k lines for one of my data set.

If it is raster, you may consider using postgis raster type.
If you really want to keep that much geometry,
you may want to partition your data on a regular grid.
Cheers,
Rémi-C

2015-01-15 15:45 GMT+01:00 Andy Colson <andy@squeakycode.net>:

Show quoted text

On 1/15/2015 6:44 AM, Daniel Begin wrote:

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there
are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

Set maintenance_work_mem as large as you can.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Rémi Cura (#5)
Re: Indexing large table of coordinates with GiST

As Remi notes, going with a pointcloud approach might be wiser, particularly if you aren’t storing much more about the points that coordinates and other lidar return information. Since you’re only working with points, depending on your spatial distribution (over poles? dateline?) you might just geohash them and index them with a btree instead. The index will work better than a rtree for points, efficiencywise, however you’ll still have a multi-billion record table, which could cause other slowdowns, depending on your plans for accessing this data once you’ve indexed it.

P.

-- 
Paul Ramsey
http://cleverelephant.ca
http://postgis.net

On January 15, 2015 at 8:44:03 AM, Rémi Cura (remi.cura@gmail.com) wrote:

Hey,
You may want to post this on postGIS list.

I take that so many rows mean either raster or point cloud.
If it is point cloud simply consider using pg_pointcloud.
A 6 billion point cloud is about 600 k lines for one of my data set.

If it is raster, you may consider using postgis raster type.
If you really want to keep that much geometry,
you may want to partition your data on a regular grid.
Cheers,
Rémi-C

2015-01-15 15:45 GMT+01:00 Andy Colson <andy@squeakycode.net>:
On 1/15/2015 6:44 AM, Daniel Begin wrote:
Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

Set maintenance_work_mem as large as you can.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Daniel Begin
jfd553@hotmail.com
In reply to: Paul Ramsey (#6)
Re: Indexing large table of coordinates with GiST

Thank, there is a lot of potential ways to resolve this problem!

For Rob, here is a bit of context concerning my IT environment…

Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 connection with write cache enabled and backup battery) and a temp_tablespaces is pointing to a 1TB internal drive.

Now, let me answered/questioned given proposals in the order I received them…

1- Andy, I will set maintenance_work_mem as large as I can unless someone points to an important caveat.

2- Vick, partitioning the table could have been very interesting. However, I will have to query the table using both the node ID (which could have provided a nice partition criterion) and/or the node location (find nodes within a polygon). I am not familiar with table partition but I suspect I can’t create a spatial index on a table that have been partitioned (split into multiple tables that inherit from the “master" table). Am I right?

3- Rémi, so many rows does not necessarily mean either raster or points cloud (but it’s worth asking!-). As I mentioned previously, I must be able to query the table not only using nodes location (coordinates) but also using the few other fields the table contains (but mainly node IDs). So, I don’t think it could work, unless you tell me otherwise?

4- Paul, the nodes distribution is all over the world but mainly over inhabited areas. However, if I had to define a limit of some sort, I would use the dateline. Concerning spatial queries, I will want to find nodes that are within the boundary of irregular polygons (stored in another table). Is querying on irregular polygons is compatible with geohashing?

Regards,

Daniel

__________________________________________________________________

On Thu, Jan 15, 2015 at 7:44 AM, Daniel Begin <jfd553@hotmail.com> wrote:

Hi, I'm trying to create an index on coordinates (geography type) over a
large table (4.5 billion records) using GiST...

CREATE INDEX nodes_geom_idx ON nodes USING gist (geom);

The command ran for 5 days until my computer stops because a power outage!
Before restarting the index creation, I am asking the community if there are
ways to shorten the time it took the first time :-)

Any idea?

Daniel

#8Paul Ramsey
pramsey@cleverelephant.ca
In reply to: Daniel Begin (#7)
Re: Indexing large table of coordinates with GiST

 On January 15, 2015 at 12:36:29 PM, Daniel Begin (jfd553@hotmail.com(mailto:jfd553@hotmail.com)) wrote:

Paul, the nodes distribution is all over the world but mainly over inhabited areas. However, if I had to define a limit of some sort, I would use the dateline. Concerning spatial queries, I will want to find nodes that are within the boundary of irregular polygons (stored in another table). Is querying on irregular polygons is compatible with geohashing?

Well… yes you can, although the relative efficiency compared to r-tree will depend a bit on how the query polygons interact with the geohash split points. Also, if you’re planning to slam pretty large polygons through this process, expect it to be kind of slow. You’ll want to do some sharding, to spread the problem out over multiple nodes.
 

-- 
Paul Ramsey
http://cleverelephant.ca 
http://postgis.net

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Nathan Clayton
nathanclayton@gmail.com
In reply to: Daniel Begin (#7)
Re: Indexing large table of coordinates with GiST

On 1/15/2015 12:36 PM, Daniel Begin wrote:

Thank, there is a lot of potential ways to resolve this problem!

For Rob, here is a bit of context concerning my IT environment…

Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The
PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3
connection with write cache enabled and backup battery) and a
temp_tablespaces is pointing to a 1TB internal drive.

Now, let me answered/questioned given proposals in the order I
received them…

1-Andy, I will set maintenance_work_mem as large as I can unless
someone points to an important caveat.

2-Vick, partitioning the table could have been very interesting.
However, I will have to query the table using both the node ID (which
could have provided a nice partition criterion) and/or the node
location (find nodes within a polygon). I am not familiar with table
partition but I suspect I can’t create a spatial index on a table that
have been partitioned (split into multiple tables that inherit from
the “master" table). Am I right?

3-Rémi, so many rows does not necessarily mean either raster or points
cloud (but it’s worth asking!-). As I mentioned previously, I must be
able to query the table not only using nodes location (coordinates)
but also using the few other fields the table contains (but mainly
node IDs). So, I don’t think it could work, unless you tell me otherwise?

4-Paul, the nodes distribution is all over the world but mainly over
inhabited areas. However, if I had to define a limit of some sort, I
would use the dateline. Concerning spatial queries, I will want to
find nodes that are within the boundary of irregular polygons (stored
in another table). Is querying on irregular polygons is compatible
with geohashing?

Regards,

Daniel

Provided you have an integer primary key on both your node tables and
polygon tables, would it make sense to preprocess the overlaps and have
a many-to-many table with the node-id and polygon-id? Depending on the
speed in which data is ingested, you could easily build triggers to run
after inserts/updates to keep the table updated, or you could create a
globally unique autoincrement field that tracks revisions and update
everything after a given high-water mark.

Lookups and joins would be using integers and should give you much
better performance than searching through the polygons.

For the many-to-many table, something like (you can obviously parse it
out into smaller batches on the insert if you need to so you don't blow
up your memory usage. If needed you can have two tables partitioned on
either the node-id or the polygon-id to speed up lookups, as this table
has the potential to carry many times the records in either table -
worst case would be a cartesian join if all nodes fall within all polygons):

create table node_polygon (
node_id bigint not null,
polygon_id bigint not null,
primary key (node_id, polygon_id)
);

insert into node_polygon (node_id, polygon_id)
select
node_id,
polygon_id
from
node
inner join polygon
on node.point <@ polygon.shape;

create index ix_node_polygon_polygon on node_polygon (polygon_id);

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Rémi Cura
remi.cura@gmail.com
In reply to: Nathan Clayton (#9)
Re: Indexing large table of coordinates with GiST

Please let me one more guess ^^
Third guess : you are using topology (nodes are indexed by node_id).

- If this is the case, you could use postgis topology.
- The gain is that with this topology model, you store *shared linestring*,
and not shared points.

More seriously from what you say it seems possible to use pg_pointcloud
with your data,
if the following assumption is correct :
*When querying by other attributes, the points you get are roughly in the
same area (at least the area is a significant subset of the total area).*
So to be perfectly clear : if for a node with node_id *N*, you can expect
that the node with node_id *N+1* is spatially close to the node N, you can
use pg_pointcloud and it will be effective.

Then the solution could be : partition your points spatially (aka, from
your billions points, you create few millions of groups of points, with a
grid, clustering, whatever).
Then create an index on each group of points bounding box.
Then create an index (gist) on range(node_id) for each group of point.
.. create indexes for other attributes : on range(attribute)

The you can query your data effectively, and the index size will fit into
RAM (about 1Go for 8 Million patch for me).
The query would be :
- first get group of points of potential interest
(WHERE st_intersects(group_of_points.bbox, your_polygon) AND
group_of_points.range(node_id)&& numrange(123,678) AND other attribute
filtering )
- second, from the group of points selected, extract the actual points,
and do the fine filtering you need
(WHERE ST_Intersects(ST_MakePoint(point.X,point.Y,point.Z),your_polygon
AND node_id BETWEEN 123 AND 678 ...))

If the assumption is correct, it works well (for instance, all the billions
points I use also have a time stamp (equivalent to your node_id I would
say), I frequently query on time range and it is as fast as spatial query
(that is milliseconds order of magnitude) ).

To give you an order of magnitude of work involved it would take me a
couple of hours to put your data into pg_pointcloud (computing time would
be about 12 hours multi-processed , absolutely all inclusive).

Cheers,
Rémi-C

2015-01-16 1:18 GMT+01:00 Nathan Clayton <nathanclayton@gmail.com>:

Show quoted text

On 1/15/2015 12:36 PM, Daniel Begin wrote:

Thank, there is a lot of potential ways to resolve this problem!

For Rob, here is a bit of context concerning my IT environment…

Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The
PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3 connection
with write cache enabled and backup battery) and a temp_tablespaces is
pointing to a 1TB internal drive.

Now, let me answered/questioned given proposals in the order I received
them…

1-Andy, I will set maintenance_work_mem as large as I can unless someone
points to an important caveat.

2-Vick, partitioning the table could have been very interesting. However,
I will have to query the table using both the node ID (which could have
provided a nice partition criterion) and/or the node location (find nodes
within a polygon). I am not familiar with table partition but I suspect I
can’t create a spatial index on a table that have been partitioned (split
into multiple tables that inherit from the “master" table). Am I right?

3-Rémi, so many rows does not necessarily mean either raster or points
cloud (but it’s worth asking!-). As I mentioned previously, I must be able
to query the table not only using nodes location (coordinates) but also
using the few other fields the table contains (but mainly node IDs). So, I
don’t think it could work, unless you tell me otherwise?

4-Paul, the nodes distribution is all over the world but mainly over
inhabited areas. However, if I had to define a limit of some sort, I would
use the dateline. Concerning spatial queries, I will want to find nodes
that are within the boundary of irregular polygons (stored in another
table). Is querying on irregular polygons is compatible with geohashing?

Regards,

Daniel

Provided you have an integer primary key on both your node tables and

polygon tables, would it make sense to preprocess the overlaps and have a
many-to-many table with the node-id and polygon-id? Depending on the speed
in which data is ingested, you could easily build triggers to run after
inserts/updates to keep the table updated, or you could create a globally
unique autoincrement field that tracks revisions and update everything
after a given high-water mark.

Lookups and joins would be using integers and should give you much better
performance than searching through the polygons.

For the many-to-many table, something like (you can obviously parse it out
into smaller batches on the insert if you need to so you don't blow up your
memory usage. If needed you can have two tables partitioned on either the
node-id or the polygon-id to speed up lookups, as this table has the
potential to carry many times the records in either table - worst case
would be a cartesian join if all nodes fall within all polygons):

create table node_polygon (
node_id bigint not null,
polygon_id bigint not null,
primary key (node_id, polygon_id)
);

insert into node_polygon (node_id, polygon_id)
select
node_id,
polygon_id
from
node
inner join polygon
on node.point <@ polygon.shape;

create index ix_node_polygon_polygon on node_polygon (polygon_id);

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Daniel Begin
jfd553@hotmail.com
In reply to: Daniel Begin (#1)
Re: Indexing large table of coordinates with GiST

Nathan,
I have to verify a few things before but it might be possible to proceed as you suggest. I will also dig a bit Paul's suggestion on geohashing. I should get you back once in place.

Thanks all
Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Nathan Clayton
Sent: January-15-15 19:19
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Indexing large table of coordinates with GiST

On 1/15/2015 12:36 PM, Daniel Begin wrote:

Thank, there is a lot of potential ways to resolve this problem!

For Rob, here is a bit of context concerning my IT environment…

Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The
PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3
connection with write cache enabled and backup battery) and a
temp_tablespaces is pointing to a 1TB internal drive.

Now, let me answered/questioned given proposals in the order I
received them…

1-Andy, I will set maintenance_work_mem as large as I can unless
someone points to an important caveat.

2-Vick, partitioning the table could have been very interesting.
However, I will have to query the table using both the node ID (which
could have provided a nice partition criterion) and/or the node
location (find nodes within a polygon). I am not familiar with table
partition but I suspect I can’t create a spatial index on a table that
have been partitioned (split into multiple tables that inherit from
the “master" table). Am I right?

3-Rémi, so many rows does not necessarily mean either raster or points
cloud (but it’s worth asking!-). As I mentioned previously, I must be
able to query the table not only using nodes location (coordinates)
but also using the few other fields the table contains (but mainly
node IDs). So, I don’t think it could work, unless you tell me otherwise?

4-Paul, the nodes distribution is all over the world but mainly over
inhabited areas. However, if I had to define a limit of some sort, I
would use the dateline. Concerning spatial queries, I will want to
find nodes that are within the boundary of irregular polygons (stored
in another table). Is querying on irregular polygons is compatible
with geohashing?

Regards,

Daniel

Provided you have an integer primary key on both your node tables and polygon tables, would it make sense to preprocess the overlaps and have a many-to-many table with the node-id and polygon-id? Depending on the speed in which data is ingested, you could easily build triggers to run after inserts/updates to keep the table updated, or you could create a globally unique autoincrement field that tracks revisions and update everything after a given high-water mark.

Lookups and joins would be using integers and should give you much better performance than searching through the polygons.

For the many-to-many table, something like (you can obviously parse it out into smaller batches on the insert if you need to so you don't blow up your memory usage. If needed you can have two tables partitioned on either the node-id or the polygon-id to speed up lookups, as this table has the potential to carry many times the records in either table - worst case would be a cartesian join if all nodes fall within all polygons):

create table node_polygon (
node_id bigint not null,
polygon_id bigint not null,
primary key (node_id, polygon_id)
);

insert into node_polygon (node_id, polygon_id) select
node_id,
polygon_id
from
node
inner join polygon
on node.point <@ polygon.shape;

create index ix_node_polygon_polygon on node_polygon (polygon_id);

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Daniel Begin
jfd553@hotmail.com
In reply to: Daniel Begin (#1)
Re: Indexing large table of coordinates with GiST

Nathan, and all others,

I already have links between each node and each polygon at an initial state. Creating a many-to-many table (nodes_polygons) and indexing it should be easy at this point. However, if nodes are expected to remain static, new irregular polygons will be added without being able to preprocess the nodes.

Instead of querying nodes table with these new polygons, I might rather query polygons table (which is much smaller and already has a GiST index) to find polygons that are intersecting a new one. From there, I can easily get the related nodes subset that will be many orders of magnitude smaller - in which case spatial indexing shouldn't be necessary.

Once done for a new polygon, I could then update the polygon and nodes_polygons tables.

Something I miss?

Daniel

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Nathan Clayton
Sent: January-15-15 19:19
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Indexing large table of coordinates with GiST

On 1/15/2015 12:36 PM, Daniel Begin wrote:

Thank, there is a lot of potential ways to resolve this problem!

For Rob, here is a bit of context concerning my IT environment…

Windows 7 64b Desktop, running with an Intel i7 core and 16GB ram. The
PostgreSQL 9.3 database is stored on a 3TB external drive (USB-3
connection with write cache enabled and backup battery) and a
temp_tablespaces is pointing to a 1TB internal drive.

Now, let me answered/questioned given proposals in the order I
received them…

1-Andy, I will set maintenance_work_mem as large as I can unless
someone points to an important caveat.

2-Vick, partitioning the table could have been very interesting.
However, I will have to query the table using both the node ID (which
could have provided a nice partition criterion) and/or the node
location (find nodes within a polygon). I am not familiar with table
partition but I suspect I can’t create a spatial index on a table that
have been partitioned (split into multiple tables that inherit from
the “master" table). Am I right?

3-Rémi, so many rows does not necessarily mean either raster or points
cloud (but it’s worth asking!-). As I mentioned previously, I must be
able to query the table not only using nodes location (coordinates)
but also using the few other fields the table contains (but mainly
node IDs). So, I don’t think it could work, unless you tell me otherwise?

4-Paul, the nodes distribution is all over the world but mainly over
inhabited areas. However, if I had to define a limit of some sort, I
would use the dateline. Concerning spatial queries, I will want to
find nodes that are within the boundary of irregular polygons (stored
in another table). Is querying on irregular polygons is compatible
with geohashing?

Regards,

Daniel

Provided you have an integer primary key on both your node tables and polygon tables, would it make sense to preprocess the overlaps and have a many-to-many table with the node-id and polygon-id? Depending on the speed in which data is ingested, you could easily build triggers to run after inserts/updates to keep the table updated, or you could create a globally unique autoincrement field that tracks revisions and update everything after a given high-water mark.

Lookups and joins would be using integers and should give you much better performance than searching through the polygons.

For the many-to-many table, something like (you can obviously parse it out into smaller batches on the insert if you need to so you don't blow up your memory usage. If needed you can have two tables partitioned on either the node-id or the polygon-id to speed up lookups, as this table has the potential to carry many times the records in either table - worst case would be a cartesian join if all nodes fall within all polygons):

create table node_polygon (
node_id bigint not null,
polygon_id bigint not null,
primary key (node_id, polygon_id)
);

insert into node_polygon (node_id, polygon_id) select
node_id,
polygon_id
from
node
inner join polygon
on node.point <@ polygon.shape;

create index ix_node_polygon_polygon on node_polygon (polygon_id);

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general