Large data and slow queries

Started by Samuel Williamsalmost 9 years ago27 messagesgeneral
Jump to latest
#1Samuel Williams
space.ship.traveller@gmail.com

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Samuel Williams
space.ship.traveller@gmail.com
In reply to: Samuel Williams (#1)
Re: Large data and slow queries

Oh, I've also tried earth distance and ll_to_earth in a GIST index...
it was slower that the BTREE index on a small subset of data in my
tests.

On 19 April 2017 at 16:01, Samuel Williams
<space.ship.traveller@gmail.com> wrote:

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3John R Pierce
pierce@hogranch.com
In reply to: Samuel Williams (#1)
Re: Large data and slow queries

On 4/18/2017 9:01 PM, Samuel Williams wrote:

We want the following kinds of query to be fast:

SELECT ... AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

I wonder if GIST would work better if you use the native POINT type, and
compared it like

mypoint <@ BOX
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
))'

with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective,
thats where you want an index.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Samuel Williams
space.ship.traveller@gmail.com
In reply to: John R Pierce (#3)
Re: Large data and slow queries

Thanks John. Yes, you are absolutely right, you want the index to be
bottom heavy so you can cull as much as possible at the top. I'm
familiar with that, once implementing a brute-force sudoku solver, it
has the same principle.

I've been working on this all afternoon. By reducing the longitude,
latitude columns to float4, in my test cases, I found about 50%
improvement in performance. It may also use less space. So part of the
problem was my choice of data type. We've computed that float4 has a
worst case precision of about 1.6m which we are okay with for
analytics data.

Another option we may consider is using a (signed) integer - e.g.
longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
uniform error across all points, but it's a bit more cumbersome to
handle. Is there a rational datatype in postgres which works like
this?

On 19 April 2017 at 16:42, John R Pierce <pierce@hogranch.com> wrote:

On 4/18/2017 9:01 PM, Samuel Williams wrote:

We want the following kinds of query to be fast:

SELECT ... AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

I wonder if GIST would work better if you use the native POINT type, and
compared it like

mypoint <@ BOX
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
))'

with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective, thats
where you want an index.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5vinny
vinny@xs4all.nl
In reply to: Samuel Williams (#4)
Re: Large data and slow queries

On 2017-04-19 07:04, Samuel Williams wrote:

Thanks John. Yes, you are absolutely right, you want the index to be
bottom heavy so you can cull as much as possible at the top. I'm
familiar with that, once implementing a brute-force sudoku solver, it
has the same principle.

I've been working on this all afternoon. By reducing the longitude,
latitude columns to float4, in my test cases, I found about 50%
improvement in performance. It may also use less space. So part of the
problem was my choice of data type. We've computed that float4 has a
worst case precision of about 1.6m which we are okay with for
analytics data.

Another option we may consider is using a (signed) integer - e.g.
longitude = 180*(v/2^31) and latitude = 180*(v/2^31) as this has a
uniform error across all points, but it's a bit more cumbersome to
handle. Is there a rational datatype in postgres which works like
this?

On 19 April 2017 at 16:42, John R Pierce <pierce@hogranch.com> wrote:

On 4/18/2017 9:01 PM, Samuel Williams wrote:

We want the following kinds of query to be fast:

SELECT ... AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

I wonder if GIST would work better if you use the native POINT type,
and
compared it like

mypoint <@ BOX
'((174.6307139779924,-37.03079375089291),(175.0805140220076,-36.67086424910709
))'

with a gist index on mypoint...

but, it all hinges on which clauses in your query are most selective,
thats
where you want an index.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Did that 50% performance gain come from just the datatype, or that fact
that the index became smaller?

Given the number of records, my first thought was either partitioning or
partial-indexes.
The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long
ranges or dates.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6John R Pierce
pierce@hogranch.com
In reply to: vinny (#5)
Re: Large data and slow queries

On 4/19/2017 12:31 AM, vinny wrote:

Given the number of records, my first thought was either partitioning
or partial-indexes.
The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long
ranges or dates.

that only works if the planner can figure out which partitions to use in
advance, otherwise it ends up having to scan all the partitions.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Samuel Williams
space.ship.traveller@gmail.com
In reply to: John R Pierce (#6)
Re: Large data and slow queries

Did that 50% performance gain come from just the datatype, or that fact that the index became smaller?

How would one measure this?

On 19 April 2017 at 19:48, John R Pierce <pierce@hogranch.com> wrote:

On 4/19/2017 12:31 AM, vinny wrote:

Given the number of records, my first thought was either partitioning or
partial-indexes.
The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long ranges
or dates.

that only works if the planner can figure out which partitions to use in
advance, otherwise it ends up having to scan all the partitions.

--
john r pierce, recycling bits in santa cruz

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Alban Hertroys
haramrae@gmail.com
In reply to: Samuel Williams (#1)
Re: Large data and slow queries

On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan all index entries from that date and within those location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching the selection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of the index entries that it has to scan now...

How high is the correlation between user id and location? That can probably be leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single purpose-specific one? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on (created_at, user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of indices you have on that table, as well as their sizes, making it all fit into cache a little easier. Then again, additional operations will be required to combine them.

For a different approach; It may be possible to enrich your data with something that is easy to index and query, with a high correlation to (latitude, longitude). That could also be used to partition over. Country seems a good candidate here, unless all your data-points are in New Zealand like the above?
Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly concentrated (Delhi, Moscow to name a few).
Another option is to divide the location space up into squares of a fixed size, with a partition for each square. About 80% of those squares are unpopulated though, being at sea.

Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a certain age. Truncating or dropping a partition that you no longer need is quick and easy.

With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of partitions under a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's possible by abusing multiple levels of inheritance, but that sounds like a bad idea.

And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would slow down the query planner.

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Samuel Williams
space.ship.traveller@gmail.com
In reply to: Alban Hertroys (#8)
Re: Large data and slow queries

Thanks Alban, I appreciate your ideas and thoughts.

I'm a little reluctant to go down the partitioning route as I think
we'll probably end up sharding out horizontally using citus data in
the near future and naive postgres partitioning may hamper that
effort.

It's interesting you talk about using multiple indexes. In
MySQL/MariaDB and derivatives, I've never seen the query planner
consider using multiple indexes. So, it's possible that Postgres may
use multiple indexes if it saves time? Or do you mean, doing something
like manually joining the data and leveraging the different indexes
explicitly?

The correlation between user_id and location... well, it's somewhat
temporally related.

On 19 April 2017 at 22:50, Alban Hertroys <haramrae@gmail.com> wrote:

On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan all index entries from that date and within those location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching the selection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of the index entries that it has to scan now...

How high is the correlation between user id and location? That can probably be leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single purpose-specific one? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on (created_at, user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of indices you have on that table, as well as their sizes, making it all fit into cache a little easier. Then again, additional operations will be required to combine them.

For a different approach; It may be possible to enrich your data with something that is easy to index and query, with a high correlation to (latitude, longitude). That could also be used to partition over. Country seems a good candidate here, unless all your data-points are in New Zealand like the above?
Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly concentrated (Delhi, Moscow to name a few).
Another option is to divide the location space up into squares of a fixed size, with a partition for each square. About 80% of those squares are unpopulated though, being at sea.

Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a certain age. Truncating or dropping a partition that you no longer need is quick and easy.

With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of partitions under a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's possible by abusing multiple levels of inheritance, but that sounds like a bad idea.

And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would slow down the query planner.

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10vinny
vinny@xs4all.nl
In reply to: John R Pierce (#6)
Re: Large data and slow queries

On 2017-04-19 09:48, John R Pierce wrote:

On 4/19/2017 12:31 AM, vinny wrote:

Given the number of records, my first thought was either partitioning
or partial-indexes.
The fewer rows are in the index, the quicker it will be to check,
and it's not a lot of work to create separate indexes for lat/long
ranges or dates.

that only works if the planner can figure out which partitions to use
in advance, otherwise it ends up having to scan all the partitions.

--
john r pierce, recycling bits in santa cruz

True, but then again, the proposed queries are quite straight-forward
so I don't expect that to be a problem, really.

Worth a test, if only to see if it helps.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: Samuel Williams (#1)
Re: Large data and slow queries

Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning behind
this.

For column 'what', it seems you have no index on all values, only indices
with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other columns?

With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com

-----Original Message-----
From: Samuel Williams
Sent: Wednesday, April 19, 2017 6:01 AM
To: pgsql-general
Subject: [GENERAL] Large data and slow queries

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

Kind regards,
Samuel

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12Samuel Williams
space.ship.traveller@gmail.com
In reply to: Martijn Tonies (Upscene Productions) (#11)
Re: Large data and slow queries

Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13vinny
vinny@xs4all.nl
In reply to: Martijn Tonies (Upscene Productions) (#11)
Re: Large data and slow queries

On 2017-04-19 13:25, Martijn Tonies (Upscene Productions) wrote:

Samuel, others,

Perhaps I'm missing something, but I'd be interested in the reasoning
behind this.

For column 'what', it seems you have no index on all values, only
indices with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other
columns?

With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com

That's a "partial index", it only contains records that meet the
requirements of the index definition.

https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if
your query contains "where name=kees"
the planner can just load that index and know that the records in that
index will not contain
any other names, saving the need to filter for 'name=kees'

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

In reply to: vinny (#13)
Re: Large data and slow queries

Hello Vinny, Samuel,

Perhaps I'm missing something, but I'd be interested in the reasoning
behind this.

For column 'what', it seems you have no index on all values, only
indices with specific values for 'what'.

How does this speed up the search? Will PostgreSQL use those indices,
instead of using a generic index on 'what' and optionally other
columns?

That's a "partial index", it only contains records that meet the
requirements of the index definition.

https://www.postgresql.org/docs/9.5/static/indexes-partial.html

Basically; if you create an index on records where 'name = kees' then if
your query contains "where name=kees"
the planner can just load that index and know that the records in that
index will not contain
any other names, saving the need to filter for 'name=kees'

Martijn that is a good question. It's because we are only concerned
with a subset of events for this index and this particular query. The
query planner can recognise this and use the index correctly. By doing
this, we reduce the size of the index significantly. In the best case,
where we only wanted a few things, the index was reduced from 21GB to
8MB.

Thank for the answers, I've seen such indices before, but used, for example,
on an UPPER(...) of all values, never thought of using them to filter out
specific values in order to make the index smaller.

With regards,

Martijn Tonies
Upscene Productions
http://www.upscene.com

Database Workbench - developer tool for Oracle, MS SQL Server, PostgreSQL,
SQL Anywhere, MySQL, InterBase, NexusDB and Firebird.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#15Stephen Frost
sfrost@snowman.net
In reply to: Samuel Williams (#1)
Re: Large data and slow queries

Greetings,

* Samuel Williams (space.ship.traveller@gmail.com) wrote:

We want the following kinds of query to be fast:

"kinds of query" isn't helpful, you should be reviewing exactly the
queries you care about because statistics and your exact data set and
what the exact query you're running is will all have an impact.

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

Ugh. You should really look at and consider PostGIS whenever you're
working with geospatial data.

We have a btree index and it appears to be working. However, it's
still pretty slow.

You're finding some 17M records and then reducing that with DISTINCT to
only 114k. Hard to say if it's faster to just brute-force your way
through that with a HashAgg (as your explain analyze shows), or if a
loose index scan would work better (check out how to do one in PG here:
https://wiki.postgresql.org/wiki/Loose_indexscan).

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

That explain analyze shows a whole ton of heap fetches. When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

BRIN would be helpful if you wanted a smaller index. That doesn't seem
to be the case here.

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

Materialized views are often useful, particularly when the results are
(relatively) slow moving.

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

If you need to go the materialized view route, I'd definitely recommend
doing that in the database rather than outside it.

Thanks!

Stephen

#16Samuel Williams
space.ship.traveller@gmail.com
In reply to: Stephen Frost (#15)
Re: Large data and slow queries

Thanks for all the suggestions Stephen.

That explain analyze shows a whole ton of heap fetches. When was the

last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?

On 20 April 2017 at 00:48, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Samuel Williams (space.ship.traveller@gmail.com) wrote:

We want the following kinds of query to be fast:

"kinds of query" isn't helpful, you should be reviewing exactly the
queries you care about because statistics and your exact data set and
what the exact query you're running is will all have an impact.

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

Ugh. You should really look at and consider PostGIS whenever you're
working with geospatial data.

We have a btree index and it appears to be working. However, it's
still pretty slow.

You're finding some 17M records and then reducing that with DISTINCT to
only 114k. Hard to say if it's faster to just brute-force your way
through that with a HashAgg (as your explain analyze shows), or if a
loose index scan would work better (check out how to do one in PG here:
https://wiki.postgresql.org/wiki/Loose_indexscan).

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

That explain analyze shows a whole ton of heap fetches. When was the
last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

Perhaps add a BRIN index on created_at

BRIN would be helpful if you wanted a smaller index. That doesn't seem
to be the case here.

I'm wondering if... we can use an index to cache, all user_ids seen on
a given day. If we change our index to be more discrete, e.g.
created_at::date, would this help? The set union of user_ids for 365
days should be pretty fast?

Materialized views are often useful, particularly when the results are
(relatively) slow moving.

I'm open to any ideas or suggestions, ideally we can keep
optimisations within the database, rather than adding a layer of
caching on top.

If you need to go the materialized view route, I'd definitely recommend
doing that in the database rather than outside it.

Thanks!

Stephen

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#17Stephen Frost
sfrost@snowman.net
In reply to: Samuel Williams (#16)
Re: Large data and slow queries

Greetings,

* Samuel Williams (space.ship.traveller@gmail.com) wrote:

Thanks for all the suggestions Stephen.

That explain analyze shows a whole ton of heap fetches. When was the

last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?

Sure, but that doesn't mean the table gets vacuumed. In particular,
actual vacuums are only kicked off when the number of *updated* or
*deleted* tuples passes the autovacuum threshold. If no updates or
deletes are done on the table (iow, it's essentially an insert-only or
insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
(though it'll run analyze's).

https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD

Check pg_stat_all_tables to see when/if the table has actually been
vacuumed.

Thanks!

Stephen

#18Samuel Williams
space.ship.traveller@gmail.com
In reply to: Stephen Frost (#17)
Re: Large data and slow queries

Ah right, yeah, it's insert only. So, it's never been vacuumed.

On 20 April 2017 at 01:25, Stephen Frost <sfrost@snowman.net> wrote:

Greetings,

* Samuel Williams (space.ship.traveller@gmail.com) wrote:

Thanks for all the suggestions Stephen.

That explain analyze shows a whole ton of heap fetches. When was the

last time a VACUUM was run on this table, to build the visibility map?
Without the visibility map being current, an Index-Only Scan, as is
happening here, can really suck.

I'm using the default 9.6 config, I thought that auto-vacuum was on by default?

Sure, but that doesn't mean the table gets vacuumed. In particular,
actual vacuums are only kicked off when the number of *updated* or
*deleted* tuples passes the autovacuum threshold. If no updates or
deletes are done on the table (iow, it's essentially an insert-only or
insert-mostly table), the autovacuum daemon nevers runs a vacuum on it
(though it'll run analyze's).

https://www.postgresql.org/docs/current/static/runtime-config-autovacuum.html#GUC-AUTOVACUUM-VACUUM-THRESHOLD

Check pg_stat_all_tables to see when/if the table has actually been
vacuumed.

Thanks!

Stephen

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#19Alban Hertroys
haramrae@gmail.com
In reply to: Samuel Williams (#9)
Re: Large data and slow queries

On 19 Apr 2017, at 12:58, Samuel Williams <space.ship.traveller@gmail.com> wrote:

It's interesting you talk about using multiple indexes. In
MySQL/MariaDB and derivatives, I've never seen the query planner
consider using multiple indexes. So, it's possible that Postgres may
use multiple indexes if it saves time? Or do you mean, doing something
like manually joining the data and leveraging the different indexes
explicitly?

PG is capable of doing bitmap heap scans to combine results from multiple indices, among other things.
Whether that will actually improve performance in this case I don't know, but it's worth a try I think.

The correlation between user_id and location... well, it's somewhat
temporally related.

So users are constantly moving around but happen to be at the same locations at regular intervals?

In my experience, people don't usually move around much, so you should certainly be able to pinpoint them mostly to a specific area, right? (Hence my suggestions for a country column or partitioning in squares)

On 19 April 2017 at 22:50, Alban Hertroys <haramrae@gmail.com> wrote:

On 19 Apr 2017, at 6:01, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Hi.

We have 400,000,000 records in a table (soon to be 800,000,000), here
is the schema (\d+)

https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

We want the following kinds of query to be fast:

SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
"user_event"."what" IN ('poll', 'location_change',
'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
01:23:55') AND (latitude > -37.03079375089291 AND latitude <
-36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
175.0805140220076);

We have a btree index and it appears to be working. However, it's
still pretty slow.

EXPLAIN ANALYZE gives the following:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314

I'm thinking that I need to do the following to help:

CLUSTER user_event ON index_user_event_for_visits_3 followed by
analyze... Our data is mostly time series but sometimes we get some
dumps with historical records.

It seems to me that a large part of the problem is that the server has to scan all index entries from that date and within those location bounds to find that the distinct user id's in that set are about 114,000 out of 1.7M rows matching the selection-criteria. If it could stop at the first location for each user, it would have to scan less than a 10th of the index entries that it has to scan now...

How high is the correlation between user id and location? That can probably be leveraged…
Perhaps you'd get better performance if you'd use multiple indices instead of stuffing everything into a single purpose-specific one? I would suggest one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and one on (created_at, user_id), or perhaps (created_at, latitude, longitude). That approach could also reduce the number of indices you have on that table, as well as their sizes, making it all fit into cache a little easier. Then again, additional operations will be required to combine them.

For a different approach; It may be possible to enrich your data with something that is easy to index and query, with a high correlation to (latitude, longitude). That could also be used to partition over. Country seems a good candidate here, unless all your data-points are in New Zealand like the above?
Then again, some countries are a lot larger, with a higher population, than others. And populations can be highly concentrated (Delhi, Moscow to name a few).
Another option is to divide the location space up into squares of a fixed size, with a partition for each square. About 80% of those squares are unpopulated though, being at sea.

Created_at is a very good candidate for partitioning too, especially if you don't intend to keep data older than a certain age. Truncating or dropping a partition that you no longer need is quick and easy.

With data-sets this large, I'd think you would want to partition on multiple dimensions, creating a matrix of partitions under a single master table. I don't think PG has a simple way of doing that (yet) though; perhaps it's possible by abusing multiple levels of inheritance, but that sounds like a bad idea.

And of course, create your partitions sufficiently course to prevent overburdening the system tables, which would slow down the query planner.

Hopefully there's something useful in my ramblings!

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#20Samuel Williams
space.ship.traveller@gmail.com
In reply to: Alban Hertroys (#19)
Re: Large data and slow queries

Okay, so after changing longitude/latitude to float4, and
re-organizing the table a bit, I got the query down from about 8
minutes to 40 seconds.

The details are in the gist comments:
https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121

Now, just need to get performance another 2 orders of magnitude
better. Can we make the index a bit more coarse grained, perhaps
convert long/lat to integers or something, use a hilbert index, use a
postgis index... ideas?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#21Vick Khera
vivek@khera.org
In reply to: Samuel Williams (#20)
#22Andrew Staller
andrew@timescale.com
In reply to: Vick Khera (#21)
#23Scott Marlowe
scott.marlowe@gmail.com
In reply to: Vick Khera (#21)
#24Samuel Williams
space.ship.traveller@gmail.com
In reply to: Andrew Staller (#22)
#25Samuel Williams
space.ship.traveller@gmail.com
In reply to: Scott Marlowe (#23)
#26Andrew Staller
andrew@timescale.com
In reply to: Samuel Williams (#24)
#27Samuel Williams
space.ship.traveller@gmail.com
In reply to: Andrew Staller (#26)