Best way to handle multi-billion row read-only table?
Hello.
I'm putting together a database to store the readings from various
measurement devices for later processing. Since these things (water
pressure monitors attached to very large water pipes) take readings at
200Hz and are typically deployed over multiple sites for several months
at a time I've got many billions of rows of data, each (at the moment)
with the following simple format:
value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL
(Where the "channel" table contains metadata to identify the particular
sensor, data logger, etc. used to obtain the data and the combination of
channel and sample_time is unique.)
Once loaded into the database the data will never be deleted or modified
and will typically be accessed over a particular date range for a
particular channel (e.g. "sample_time >= X AND sample_time <= Y AND
channel=Z"). A typical query won't return more than a few million rows
and speed is not desperately important (as long as the time is measured
in minutes rather than hours).
Are there any recommended ways to organise this? Should I partition my
big table into multiple smaller ones which will always fit in memory
(this would result in several hundreds or thousands of sub-tables)? Are
there any ways to keep the index size to a minimum? At the moment I have
a few weeks of data, about 180GB, loaded into a single table and indexed
on sample_time and channel and the index takes up 180GB too.
Since this is all for a typically budget-restricted PhD project the
hardware is just a high-end desktop workstation with (at the moment)
2*2TB drives organised into a single 4TB partition using FreeBSD's vinum
system.
Many thanks for any help,
Asher.
On 2/9/2010 12:47 PM, Asher wrote:
Hello.
I'm putting together a database to store the readings from various
measurement devices for later processing. Since these things (water
pressure monitors attached to very large water pipes) take readings at
200Hz and are typically deployed over multiple sites for several
months at a time I've got many billions of rows of data, each (at the
moment) with the following simple format:value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL(Where the "channel" table contains metadata to identify the
particular sensor, data logger, etc. used to obtain the data and the
combination of channel and sample_time is unique.)
Well first is that 200hz meaning 200 samples per channel per second.
That is very fast sampling for pressure sensor, I would be surprised if
the meters are actually giving real results at that rate. I would look
at reducing that down to what the meter is actual capable of sending
What kind of AD card is being used as this effects what makes sense to
record.
I would look into table partitioning
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
http://wiki.postgresql.org/wiki/Table_partitioning
A one big index for such a small record will not be a big win because
the index are going to be the same size as table.
Look into limiting the number of records each index covers.
http://www.postgresql.org/docs/8.4/static/sql-createindex.html
All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored.
CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately.
Thank you.
I've only gone up to about a billion rows, but table partitioning seems
to be the way to go to me. I did per-day partitioning, and just had the
job that inserts the daily data add the table automatically. With the
partitioning, it only has to pull up the tables for the specific days,
and is therefore a kind of date index already.
Alex Thurlow
Blastro Networks
http://www.blastro.com
http://www.roxwel.com
http://www.yallwire.com
Show quoted text
On 2/9/2010 11:47 AM, Asher wrote:
Hello.
I'm putting together a database to store the readings from various
measurement devices for later processing. Since these things (water
pressure monitors attached to very large water pipes) take readings at
200Hz and are typically deployed over multiple sites for several
months at a time I've got many billions of rows of data, each (at the
moment) with the following simple format:value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL(Where the "channel" table contains metadata to identify the
particular sensor, data logger, etc. used to obtain the data and the
combination of channel and sample_time is unique.)Once loaded into the database the data will never be deleted or
modified and will typically be accessed over a particular date range
for a particular channel (e.g. "sample_time >= X AND sample_time <= Y
AND channel=Z"). A typical query won't return more than a few million
rows and speed is not desperately important (as long as the time is
measured in minutes rather than hours).Are there any recommended ways to organise this? Should I partition my
big table into multiple smaller ones which will always fit in memory
(this would result in several hundreds or thousands of sub-tables)?
Are there any ways to keep the index size to a minimum? At the moment
I have a few weeks of data, about 180GB, loaded into a single table
and indexed on sample_time and channel and the index takes up 180GB too.Since this is all for a typically budget-restricted PhD project the
hardware is just a high-end desktop workstation with (at the moment)
2*2TB drives organised into a single 4TB partition using FreeBSD's
vinum system.Many thanks for any help,
Asher.
Justin Graf wrote:
Well first is that 200hz meaning 200 samples per channel per second.
That is very fast sampling for pressure sensor, I would be surprised if
the meters are actually giving real results at that rate. I would look
at reducing that down to what the meter is actual capable of sending
What kind of AD card is being used as this effects what makes sense to
record.
Yes, we really are measuring at 200 samples per second. We're trying to
capture high resolution images of pressure transients as they move along
water distribution pipelines (the backbones of the water network,
typically 4'-6' in diameter, carrying 500-1000 litres/second) to
understand how they travel and what stress they put upon the pipe. We're
using custom data loggers at the moment based around Intel iMote2 Linux
systems with a high-speed QuickFilter ADC (and the sensors we're using
can cope at 200Hz).
I would look into table partitioning
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
http://wiki.postgresql.org/wiki/Table_partitioning
Thanks for that, it looks like partitioning is the way to go. I'm
assuming that I should try and keep my total_relation_sizes less than
the memory size of the machine?
A one big index for such a small record will not be a big win because
the index are going to be the same size as table.
Look into limiting the number of records each index covers.
http://www.postgresql.org/docs/8.4/static/sql-createindex.html
If I partition so that each partition holds data for a single channel
(and set a CHECK constraint for this) then I can presumably remove the
channel from the index since constraint exclusion will mean that only
partitions holding the channel I'm interested in will be searched in a
query. Given that within a partition all of my sample_time's will be
different do you know if there's a more efficient way to index these?
Many thanks,
Asher
On Tue, Feb 9, 2010 at 10:47 AM, Asher <asher@piceur.co.uk> wrote:
Hello.
I'm putting together a database to store the readings from various
measurement devices for later processing. Since these things (water pressure
monitors attached to very large water pipes) take readings at 200Hz and are
typically deployed over multiple sites for several months at a time I've got
many billions of rows of data, each (at the moment) with the following
simple format:value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL(Where the "channel" table contains metadata to identify the particular
sensor, data logger, etc. used to obtain the data and the combination of
channel and sample_time is unique.)Once loaded into the database the data will never be deleted or modified and
will typically be accessed over a particular date range for a particular
channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A
typical query won't return more than a few million rows and speed is not
desperately important (as long as the time is measured in minutes rather
than hours).Are there any recommended ways to organise this? Should I partition my big
table into multiple smaller ones which will always fit in memory (this would
result in several hundreds or thousands of sub-tables)?
Partitioning is the standard way. Note that you can partition on 1
axis, or 2, or more. In this case partitioning on time and channel
might make the most sense.
Are there any ways
to keep the index size to a minimum? At the moment I have a few weeks of
data, about 180GB, loaded into a single table and indexed on sample_time and
channel and the index takes up 180GB too.
It may be that with small enough partitions indexes aren't really
needed. That's been the case for a lot of data I've worked with in
the past.
Since this is all for a typically budget-restricted PhD project the hardware
is just a high-end desktop workstation with (at the moment) 2*2TB drives
organised into a single 4TB partition using FreeBSD's vinum system.
Partitioning should definitely help. You might want to go with RAID-1
instead of RAID-0 since the read performance is similar under most
modern OSes. I know linux now aggregates the two drives together to
read, I'd assume BSD does too. That way you've got better reliability
and about the same performance. Load times will be about double, but
that's a one time thing, right?
Asher Hoskins wrote:
If I partition so that each partition holds data for a single channel
(and set a CHECK constraint for this) then I can presumably remove the
channel from the index since constraint exclusion will mean that only
partitions holding the channel I'm interested in will be searched in a
query. Given that within a partition all of my sample_time's will be
different do you know if there's a more efficient way to index these?
how do you plan on accessing this monster data? do you expect to be
looking up single values or small set of values at a specific time?
seems to me like this is the sort of data thats more often processed in
the aggregate, like running a fourier analysis of sliding windows, and
that sort of data processing may well be more efficiently done with
fixed block binary files rather than relational databases, as there's no
real relationships in this data.
for instance, a directory for each sensor, with a directory for each
week, and a file for each hour, containing the hours worth of samples in
fixed binary blocks after a file header identifying it. you can random
access a specific time sample by using fseek
(sampletime-starttimeofblock) * blocksize + headersize or whatever.
On 2/9/2010 4:41 PM, Asher Hoskins wrote:
Thanks for that, it looks like partitioning is the way to go. I'm
assuming that I should try and keep my total_relation_sizes less than
the memory size of the machine?
This depends on what the quires look like. As other have stated when
partitioning you have to consider how the data is quired.
If I partition so that each partition holds data for a single channel
(and set a CHECK constraint for this) then I can presumably remove the
channel from the index since constraint exclusion will mean that only
partitions holding the channel I'm interested in will be searched in a
query. Given that within a partition all of my sample_time's will be
different do you know if there's a more efficient way to index these?
Given the timestamp will most likely be the where clause, NO on the
plus side its only 8 bytes
All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored.
CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately.
Thank you.
Is it also possible to denormalize by putting the 'channel' data in the
first table (especially if it isn't very much)? Maintaining a foreign key
constraint can impact performance significantly in most RDBMS's, even when
deferring checking. I could be wrong, but I suspect PostgreSQL is no
different. Or keep the data normalized and remove the constraint
altogether. Also remove any primary key constraint so that it doesn't have
to check uniqueness, and avoid as many indexes as you can.
You have to take a leap of faith that you created your program well enough
to not get out of sync.
I would be interested to hear comments on this. These are some of the things
we did on systems I have worked on running Oracle that handled even higher
volumes (tens to hundreds of thousands of transactions per second or
higher... sustained throughout the day at least on the lower volume).
Granted we had real heavy hardware but the DBAs forbade us to create
constraints and indexes etc. for this reason; except on less active tables.
Everyone has already talked about partitioning, but load balancing across
machines if you can afford a couple or few more could help too. Not sure
what facility Postgres has for this though (I would be interested to hear
comments on this too! :-)
BillR
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Asher
Sent: February-09-10 12:47 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] Best way to handle multi-billion row read-only table?
Hello.
I'm putting together a database to store the readings from various
measurement devices for later processing. Since these things (water
pressure monitors attached to very large water pipes) take readings at
200Hz and are typically deployed over multiple sites for several months
at a time I've got many billions of rows of data, each (at the moment)
with the following simple format:
value REAL NOT NULL,
sample_time TIMESTAMP WITH TIME ZONE NOT NULL,
channel INTEGER REFERENCES channel(id) NOT NULL
(Where the "channel" table contains metadata to identify the particular
sensor, data logger, etc. used to obtain the data and the combination of
channel and sample_time is unique.)
Once loaded into the database the data will never be deleted or modified
and will typically be accessed over a particular date range for a
particular channel (e.g. "sample_time >= X AND sample_time <= Y AND
channel=Z"). A typical query won't return more than a few million rows
and speed is not desperately important (as long as the time is measured
in minutes rather than hours).
Are there any recommended ways to organise this? Should I partition my
big table into multiple smaller ones which will always fit in memory
(this would result in several hundreds or thousands of sub-tables)? Are
there any ways to keep the index size to a minimum? At the moment I have
a few weeks of data, about 180GB, loaded into a single table and indexed
on sample_time and channel and the index takes up 180GB too.
Since this is all for a typically budget-restricted PhD project the
hardware is just a high-end desktop workstation with (at the moment)
2*2TB drives organised into a single 4TB partition using FreeBSD's vinum
system.
Many thanks for any help,
Asher.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--------------------------------
Spam/Virus scanning by CanIt Pro
For more information see
http://www.kgbinternet.com/SpamFilter.htm
To control your spam filter, log in at
http://filter.kgbinternet.com
--
BEGIN-ANTISPAM-VOTING-LINKS
------------------------------------------------------
Teach CanIt if this mail (ID 80270060) is spam:
Spam:
http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002
09&c=s
Not spam:
http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002
09&c=n
Forget vote:
http://filter.kgbinternet.com/canit/b.php?i=80270060&m=5d99840e72f9&t=201002
09&c=f
------------------------------------------------------
END-ANTISPAM-VOTING-LINKS
__________ Information from ESET Smart Security, version of virus signature
database 4852 (20100209) __________
The message was checked by ESET Smart Security.
On Tue, Feb 9, 2010 at 8:06 PM, BillR <iambill@williamrosmus.com> wrote:
Is it also possible to denormalize by putting the 'channel' data in the
first table (especially if it isn't very much)? Maintaining a foreign key
constraint can impact performance significantly in most RDBMS's, even when
deferring checking. I could be wrong, but I suspect PostgreSQL is no
different. Or keep the data normalized and remove the constraint
altogether. Also remove any primary key constraint so that it doesn't have
to check uniqueness, and avoid as many indexes as you can.You have to take a leap of faith that you created your program well enough
to not get out of sync.I would be interested to hear comments on this. These are some of the things
we did on systems I have worked on running Oracle that handled even higher
volumes (tens to hundreds of thousands of transactions per second or
higher... sustained throughout the day at least on the lower volume).
Granted we had real heavy hardware but the DBAs forbade us to create
constraints and indexes etc. for this reason; except on less active tables.
Everyone has already talked about partitioning, but load balancing across
machines if you can afford a couple or few more could help too. Not sure
what facility Postgres has for this though (I would be interested to hear
I was under the impression the data was being gathered elsewhere and
then imported, so the insert performance isn't as critical as if it
was being done real time.
Asher wrote:
Once loaded into the database the data will never be deleted or
modified and will typically be accessed over a particular date range
for a particular channel (e.g. "sample_time >= X AND sample_time <= Y
AND channel=Z"). A typical query won't return more than a few million
rows and speed is not desperately important (as long as the time is
measured in minutes rather than hours).Are there any recommended ways to organise this? Should I partition my
big table into multiple smaller ones which will always fit in memory
(this would result in several hundreds or thousands of sub-tables)?
Are there any ways to keep the index size to a minimum? At the moment
I have a few weeks of data, about 180GB, loaded into a single table
and indexed on sample_time and channel and the index takes up 180GB too.
One approach to consider is partitioning by sample_time and not even
including the channel number in the index. You've got tiny records;
there's going to be hundreds of channels of data on each data page
pulled in, right? Why not minimize physical I/O by reducing the index
size, just read that whole section of time in to memory (they should be
pretty closely clustered and therefore mostly sequential I/O), and then
push the filtering by channel onto the CPU instead. If you've got
billions of rows, you're going to end up disk bound anyway; minimizing
physical I/O and random seeking around at the expense of CPU time could
be a big win.
--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Feb 9, 2010 at 11:51 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Asher wrote:
Once loaded into the database the data will never be deleted or modified
and will typically be accessed over a particular date range for a particular
channel (e.g. "sample_time >= X AND sample_time <= Y AND channel=Z"). A
typical query won't return more than a few million rows and speed is not
desperately important (as long as the time is measured in minutes rather
than hours).Are there any recommended ways to organise this? Should I partition my big
table into multiple smaller ones which will always fit in memory (this would
result in several hundreds or thousands of sub-tables)? Are there any ways
to keep the index size to a minimum? At the moment I have a few weeks of
data, about 180GB, loaded into a single table and indexed on sample_time and
channel and the index takes up 180GB too.One approach to consider is partitioning by sample_time and not even
including the channel number in the index. You've got tiny records; there's
going to be hundreds of channels of data on each data page pulled in, right?
Why not minimize physical I/O by reducing the index size, just read that
whole section of time in to memory (they should be pretty closely clustered
and therefore mostly sequential I/O), and then push the filtering by channel
onto the CPU instead. If you've got billions of rows, you're going to end
up disk bound anyway; minimizing physical I/O and random seeking around at
the expense of CPU time could be a big win.
If they're put in in a one time load, load them in channel order into
the partitions, and the stats should see the perfect ordering and know
to seq scan the right part of the table. I think.
John R Pierce wrote:
how do you plan on accessing this monster data? do you expect to be
looking up single values or small set of values at a specific time?
seems to me like this is the sort of data thats more often processed in
the aggregate, like running a fourier analysis of sliding windows, and
that sort of data processing may well be more efficiently done with
fixed block binary files rather than relational databases, as there's no
real relationships in this data.
The data will initially be accessed via a simple GUI which will allow
browsing over a subset of the data (subsampled down to 1
sample/minute/hour, etc. during the data load phase and so massively
smaller datasets) and then once something "interesting" has been found
manually (fully automatic detection of transients has to wait until
we've formally described what we mean by "transient" :-)) the start and
end times can be handed over to our automatic processing code to go
through the full dataset.
I did consider just sticking the data into a series of big dumb files
but by putting them in a DB I can both maintain automatic links between
the full and subsampled data sets and between each data point and the
equipment that measured it and, possibly more importantly, I can provide
a simpler interface to the other people on my project to access the
data. I'm a computer scientist but I'm doing my PhD in the Civil
Engineering dept and all of my colleagues are civil engineers - all
quite happy using Matlab's database plugin but less happy writing
traditional code to crunch through raw files. I'm aware that I'm taking
a, possibly quite large, performance hit by using a database but I'm
hoping that the advantages will outweigh this.
Many thanks for all the replies to my query. I'm going to go with a
partitioned table design and start uploading some data. I'll post how it
performs once I've got some real size data in it.
Asher.
On Wed, Feb 10, 2010 at 2:32 PM, Asher <asher@piceur.co.uk> wrote:
The data will initially be accessed via a simple GUI which will allow
browsing over a subset of the data (subsampled down to 1 sample/minute/hour,
etc.
It sounds like you could use a tool like rrd that keeps various levels
of aggregation and intelligently chooses the right level for the given
query. I think there are such tools though I'm not sure there are any
free ones.
--
greg
2010/2/10 Greg Stark <gsstark@mit.edu>:
On Wed, Feb 10, 2010 at 2:32 PM, Asher <asher@piceur.co.uk> wrote:
The data will initially be accessed via a simple GUI which will allow
browsing over a subset of the data (subsampled down to 1 sample/minute/hour,
etc.It sounds like you could use a tool like rrd that keeps various levels
of aggregation and intelligently chooses the right level for the given
query. I think there are such tools though I'm not sure there are any
free ones.
Use as much memory as possible to fit indexes as well as portions of
the table space itself in RAM.
Of course, poor indexing can kill any effort.
--
Vincenzo Romano
NotOrAnd Information Technologies
NON QVIETIS MARIBVS NAVTA PERITVS