advice sought - general approaches to optimizing queries around "event streams"
I have a growing database with millions of rows that track resources against an event stream.
i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.
no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.
a simple form of my database would be:
-- 1k of
create table stream (
id int not null primary key,
)
-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);
-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);
-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)
Everything is running off of indexes; there are no seq scans.
I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table )
The bottlenecks I've encountered have primarily been:
1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but distant events are troublesome
using no limit, the query takes 3500 ms
using a limit of 10000, the query takes 320ms
using a limit of 1000, the query takes 20ms
there is a dedicated index of on event_timestamp (desc) , and it is being used
according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)
2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)
I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms.
I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events
i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.
i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
We have a similar timeseries database approaching 500m records.
We partition the main tables (much like your events) into one year subsets, with a clustered index on timestamp for all but the live year.
https://blog.engineyard.com/2013/scaling-postgresql-performance-table-partitioning
http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html
http://www.postgresql.org/docs/9.3/static/sql-cluster.html
As discussed here previously, you can also improve performance using hardware - SSD'd vs spindles. Also note that tablespaces, with indexes on your faster drives & data on slower ones can improve performance.
http://www.postgresql.org/docs/9.3/static/manage-ag-tablespaces.html
Also make sure your db server is optimised for the database size & hardware configuration - like perhaps alloe fewer concurrent users, but more resources per user, or see what pgtune recommends.
Should help your performance, in terms of underlying db efficiency & performance, rather than tweaking your actual queries.
Brent Wood
Programme leader: Environmental Information Delivery
NIWA
DDI: +64 (4) 3860529
Brent Wood
Principal Technician - GIS and Spatial Data Management
Programme Leader - Environmental Information Delivery
+64-4-386-0529 | 301 Evans Bay Parade, Greta Point, Wellington | www.niwa.co.nz<http://www.niwa.co.nz>
[NIWA]<http://www.niwa.co.nz>
________________________________________
From: pgsql-general-owner@postgresql.org <pgsql-general-owner@postgresql.org> on behalf of Jonathan Vanasco <postgres@2xlp.com>
Sent: Saturday, September 27, 2014 9:02 AM
To: PostgreSQL general
Subject: [GENERAL] advice sought - general approaches to optimizing queries around "event streams"
I have a growing database with millions of rows that track resources against an event stream.
i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.
no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.
a simple form of my database would be:
-- 1k of
create table stream (
id int not null primary key,
)
-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);
-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);
-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)
Everything is running off of indexes; there are no seq scans.
I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table )
The bottlenecks I've encountered have primarily been:
1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but distant events are troublesome
using no limit, the query takes 3500 ms
using a limit of 10000, the query takes 320ms
using a limit of 1000, the query takes 20ms
there is a dedicated index of on event_timestamp (desc) , and it is being used
according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)
2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)
I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms.
I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events
i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.
i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Attachments:
On 27/09/14 09:02, Jonathan Vanasco wrote:
I have a growing database with millions of rows that track resources against an event stream.
i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans.
no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them.
a simple form of my database would be:
-- 1k of
create table stream (
id int not null primary key,
)-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)Everything is running off of indexes; there are no seq scans.
I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE.
better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table )The bottlenecks I've encountered have primarily been:
1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but distant events are troublesomeusing no limit, the query takes 3500 ms
using a limit of 10000, the query takes 320ms
using a limit of 1000, the query takes 20msthere is a dedicated index of on event_timestamp (desc) , and it is being used
according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by)2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search)
I thought about limiting the query by finding matching resources first, then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms.I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events
i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now.
i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries.
Minor point: when specifying PRIMARY KEY, you don't need to also put NOT
NULL (this should make no change to performance).
I notice that the 'id' of 'streamevent' is not marked as a PRIMARY KEY,
so it will not have an index associated with it - hence referencing it
as a foreign key might be slower than expected.
Cheers,
Gavin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general