10 TB database

Started by Arturalmost 17 years ago17 messagesgeneral
Jump to latest
#1Artur
a_wronski@gazeta.pl

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)

Thanks in advance,
Artur

#2Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Artur (#1)
Re: 10 TB database

On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski@gazeta.pl> wrote:

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

Well, obviously you need to decrease size of it, by doing some
normalization than.
If some information is the same across table, stick it into separate
table, and assign id to it.

If you can send me sample of that data, I could tell you where to cut size.
I have that big databases under my wings, and that's where
normalization starts to make sens, to save space (and hence speed
things up).

We want to have access to all the date mostly for generating user requesting
reports (aggregating).
We would have about 10TB of data in three years.

For that sort of database you will need partitioning for sure.

Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;)

--
GJ

#3Whit Armstrong
armstrong.whit@gmail.com
In reply to: Grzegorz Jaśkiewicz (#2)
Re: 10 TB database

I have a 300GB database, and I would like to look at partitioning as a
possible way to speed it up a bit.

I see the partitioning examples from the documentation:
http://www.postgresql.org/docs/8.3/static/ddl-partitioning.html

Is anyone aware of additional examples or tutorials on partitioning?

Thanks,
Whit

2009/6/15 Grzegorz Jaśkiewicz <gryzman@gmail.com>:

Show quoted text

On Mon, Jun 15, 2009 at 1:00 PM, Artur<a_wronski@gazeta.pl> wrote:

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

Well, obviously you need to decrease size of it, by doing some
normalization than.
If some information is the same across table, stick it into separate
table, and assign id to it.

If you can send me sample of that data, I could tell you where to cut size.
I have that big databases under my wings, and that's where
normalization starts to make sens, to save space (and hence speed
things up).

We want to have access to all the date mostly for generating user requesting
reports (aggregating).
We would have about 10TB of data in three years.

For that sort of database you will need partitioning for sure.

Napisz do mnie, to moge pomoc prywatnie, moze za niewielka danina ;)

--
GJ

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Artur (#1)
Re: 10 TB database

Artur wrote:

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

Sounds a bit like what Truviso does ...

--
Alvaro Herrera

#5Dann Corbit
DCorbit@connx.com
In reply to: Artur (#1)
Re: 10 TB database

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] On Behalf Of Artur
Sent: Monday, June 15, 2009 5:00 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] 10 TB database

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every
month.
This data would be in two tables. About 50.000.000 new rows every
month.

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)

Consider summarization of this data into a data warehouse.
Most of the data will be historical and therefore the vast majority of
the data will be read-mostly (with the rare write operations probably
consisting mostly of corrections).
You won't want to scan the whole 10TB every time you make a
summarization query.

I have an idea that might make an interesting experiment:
Create tables that are a combination of year and month.
Create views that combine all 12 months into one yearly table.
Create a view that combines all the yearly views into one global view.
The reason that I think this suggestion may have some merit is that the
historical trends will not need to be recalculated on a daily basis (but
it would be nice if you could perform calculations against the whole
pile at will on rare occasions). By maintaining separate tables by
month, it will reduce the average depth of the b-trees. I guess that
for the most part, the active calculations will be only against recent
data (e.g. the past 6 months to one year or so). It could also be
interesting to create a view that combines the N most recent months of
data, where N is supplied on the fly (I do not know how difficult it
would be to create this view or even if it is possible).

If you are going to collect a terrific volume of data like this, I
suggest that a mathematics package might be coupled with the data like
SAS, R, Octave, SciLab, Maxima, etc. so that you can support decisions
derived from the data effectively.

You are also going to need high-end hardware to support a database like
this. Just some ideas you might like to test when you start fooling
around with this data.

IMO-YMMV

#6Brent Wood
b.wood@niwa.co.nz
In reply to: Dann Corbit (#5)
Re: 10 TB database

Hi Artur,

Some general comments:

I'd look at partitioning and tablespaces to better manage the files where the data is stored, but also look at some efficiently parallelised disks behind the filesystems. You might also look at optimising the filesystem &OS parameters to increase efficiency as well, so it is a mix of hardware/OS/filesystem & db setup to optimise for such a situation.

For data retrieval, clustered indexes may help, but as this requires a physical reordering of the data on disk, it may be impractical.

Cheers,

Brent Wood

Brent Wood
DBA/GIS consultant
NIWA, Wellington
New Zealand

Artur <a_wronski@gazeta.pl> 06/16/09 3:30 AM >>>

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.
This data would be in two tables. About 50.000.000 new rows every month.

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)

Thanks in advance,
Artur

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.

#7Michelle Konzack
linux4michelle@tamay-dogan.net
In reply to: Artur (#1)
Re: 10 TB database

Hi Artur,

I am owner of a database about War, Worcrime and Terroism with more then
1,6 TByte and I am already fscked...

Am 2009-06-15 14:00:05, schrieb Artur:

Hi!

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data every month.

I have only 500 MByte per month...

This data would be in two tables. About 50.000.000 new rows every month.

arround 123.000 new rows per month

We want to have access to all the date mostly for generating user
requesting reports (aggregating).
We would have about 10TB of data in three years.

Do you think is it possible to build this with postgresql and have any
idea how to start? :)

You have to use a physical cluster like me. Searches in a Database of
more then 1 TByte even under using "tablespace" and "tablepartitioning"
let you run into performance issues...

I have now splited my Database in chunks of 250 GByte using a Cluster of
1U Servers from Sun Microsystems. Currently I run 8 servers with one
proxy. Each server cost me 2.300 Euro.

Note: On Friday I have a meeting with a Sun Partner in
Germany because a bigger project... where I have
to increase the performance of my database servers.
I have to calculate with 150.000 customers.

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
<http://www.tamay-dogan.net/&gt; Michelle Konzack
<http://www.can4linux.org/&gt; c/o Vertriebsp. KabelBW
<http://www.flexray4linux.org/&gt; Blumenstrasse 2
Jabber linux4michelle@jabber.ccc.de 77694 Kehl/Germany
IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947
ICQ #328449886 Tel. FR: +33 6 61925193

#8Pedro Doria Meunier
pdoria@netmadeira.com
In reply to: Michelle Konzack (#7)
Re: 10 TB database

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Arthur,

We have a database that has a table growing ~1,5M rows each month.
The overall growth for the db is ~1GB/month.
PostgreSQL 8.2.9 on x86_64 - a very modest Dell R200 with 4GB of ram.

Although poor planning was made in the beginning (i.e. no clustering,
no partitioning...) - we weren't expecting the boom :] - that
particular server runs like clockwork with hundreds of queries per
minute and still doing so without any noticeable speed loss.

We're, of course, planning for load balancing in the beginning of next
year ... :)

BR,

Pedro Doria Meunier
GSM: +351 96 17 20 188
Skype: pdoriam

Michelle Konzack wrote:

Hi Artur,

I am owner of a database about War, Worcrime and Terroism with more
then 1,6 TByte and I am already fscked...

Am 2009-06-15 14:00:05, schrieb Artur:

Hi!

We are thinking to create some stocks related search engine. It
is experimental project just for fun.

The problem is that we expect to have more than 250 GB of data
every month.

I have only 500 MByte per month...

This data would be in two tables. About 50.000.000 new rows every
month.

arround 123.000 new rows per month

We want to have access to all the date mostly for generating user
requesting reports (aggregating). We would have about 10TB of
data in three years.

Do you think is it possible to build this with postgresql and
have any idea how to start? :)

You have to use a physical cluster like me. Searches in a
Database of more then 1 TByte even under using "tablespace" and
"tablepartitioning" let you run into performance issues...

I have now splited my Database in chunks of 250 GByte using a
Cluster of 1U Servers from Sun Microsystems. Currently I run 8
servers with one proxy. Each server cost me 2.300 Euro.

Note: On Friday I have a meeting with a Sun Partner in Germany
because a bigger project... where I have to increase the
performance of my database servers. I have to calculate with
150.000 customers.

Thanks, Greetings and nice Day/Evening Michelle Konzack
Systemadministrator Tamay Dogan Network Debian GNU/Linux Consultant

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFKN13P2FH5GXCfxAsRAkIiAJ95GvbQhBrOglzK2d57F5Qv7E5NdgCfcKga
bFpRiWf2vSY0oMOD40PgSsg=
=4OB3
-----END PGP SIGNATURE-----

#9Todd Lieberman
tlieberman@marchex.com
In reply to: Artur (#1)
Re: 10 TB database

The problem is that we expect to have more than 250 GB of data every month.

Sounds like Terradata or Netezza teritory

#10Martin Gainty
mgainty@hotmail.com
In reply to: Todd Lieberman (#9)
Re: 10 TB database

would suggest Oracle 11 for DB of 10TB or greater
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

Martin Gainty
______________________________________________
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité

Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le destinataire prévu, nous te demandons avec bonté que pour satisfaire informez l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est interdite. Ce message sert à l'information seulement et n'aura pas n'importe quel effet légalement obligatoire. Étant donné que les email peuvent facilement être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité pour le contenu fourni.

Subject: Re: [GENERAL] 10 TB database
Date: Tue, 16 Jun 2009 08:24:07 -0700
From: tlieberman@marchex.com
To: pgsql-general@postgresql.org; pgsql-general@postgresql.org

RE: [GENERAL] 10 TB database

The problem is that we expect to have more than 250 GB of data every month.

Sounds like Terradata or Netezza teritory

_________________________________________________________________
Insert movie times and more without leaving Hotmail®.
http://windowslive.com/Tutorial/Hotmail/QuickAdd?ocid=TXT_TAGLM_WL_HM_Tutorial_QuickAdd_062009

#11Greg Smith
gsmith@gregsmith.com
In reply to: Artur (#1)
Re: 10 TB database

On Mon, 15 Jun 2009, Artur wrote:

We are thinking to create some stocks related search engine.
It is experimental project just for fun.

I hope your idea of fun involves spending a bunch of money on hardware and
endless tweaking to get data loading every day with appropriate
corrections, because that's just the first round of "fun" on a job like
this.

The problem is that we expect to have more than 250 GB of data every
month. This data would be in two tables. About 50.000.000 new rows every
month. We want to have access to all the date mostly for generating user
requesting reports (aggregating). We would have about 10TB of data in
three years. Do you think is it possible to build this with postgresql
and have any idea how to start? :)

You start by figuring out what sort of business model is going to justify
this very expensive adventure in today's market where buyers of financial
market products are pretty rare, but that's probably not the question you
wanted an answer to.

You can certainly build a server capable of handling this job with
PostgreSQL here in 2009. Get 8 cores, a stack of 24 1TB disks and a RAID
card with a write cache, and you'll have a big enough system to handle the
job. Basic database design isn't too terribly difficult either. Stock
data is trivial to partition up into tiny pieces at the database level
(each day can be its own 250GB partition), and any system capable of
holding that much data is going to have a giant stack of drives spreading
out the disk I/O too.

The first level of problems you'll run into are how to keep up with
loading data every day. The main way to get bulk data in PostgreSQL,
COPY, isn't particularly fast, and you'll be hard pressed to keep up with
250GB/day unless you write a custom data loader that keeps multiple cores
going with that load. Commercial databases have some better solutions to
solve this problem in the base product, or easily available from third
party sources.

The much, much bigger problem here is how exactly you're going to provide
a user interface to this data. You can't just give people access to the
whole thing and let them run queries; the first person who executes
something like "select symbol,avg(price) from tickdata group by symbol"
because they want to see the average price of some stock over its lifetime
is going to kill the whole server. You really need to generate the
aggregated reports ahead of time, using an approach like materialized
views, and then only let people grab those. It's possible to manually
create materialized views in PostgreSQL, but that will be yet another bit
of custom development here.

The third level of issue is how you scale the app up if you're actually
successful. It's hard enough to get 250GB of daily data loaded into a
single database and storing 10TB of data somewhere; doing the job across a
replicated set of servers, so you can spread the queries out, is even more
"fun" than that.

P.S. If you're not already familiar with how to aggregate properly over a
trading calendar that includes holidays and spots where the market is only
open part of the day, give up now; that's the hardest issue specific to
this particular type of application to get right, and a lot of people
don't realize that early enough in the design process to properly plan for
it.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#12Greg Smith
gsmith@gregsmith.com
In reply to: Whit Armstrong (#3)
Re: 10 TB database

On Mon, 15 Jun 2009, Whit Armstrong wrote:

Is anyone aware of additional examples or tutorials on partitioning?

http://www.pgcon.org/2007/schedule/events/41.en.html
http://blog.mozilla.com/webdev/2007/05/15/partitioning-fun-in-postgresql/
http://benjamin.smedbergs.us/blog/2007-05-12/when-partitioning-database-tables-explain-your-queries/

In that order really; those go from general commentary down to focusing on
specific issues people tend to run into.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#13Greg Smith
gsmith@gregsmith.com
In reply to: Brent Wood (#6)
Re: 10 TB database

On Tue, 16 Jun 2009, Brent Wood wrote:

For data retrieval, clustered indexes may help, but as this requires a
physical reordering of the data on disk, it may be impractical.

This tends to be irrelevant for this sort of data because it's normally
inserted in a fairly clustered way in the first place. The usual way
tables get unclustered involves random insertion and deletion, and that
just doesn't happen for data that's being imported daily and never deleted
afterwards; it's naturally clustered quite well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#14Michelle Konzack
linux4michelle@tamay-dogan.net
In reply to: Greg Smith (#11)
Re: 10 TB database

Hi Greg,

Am 2009-06-16 12:13:20, schrieb Greg Smith:

The first level of problems you'll run into are how to keep up with
loading data every day. The main way to get bulk data in PostgreSQL,
COPY, isn't particularly fast, and you'll be hard pressed to keep up with
250GB/day unless you write a custom data loader that keeps multiple cores

AFAIK he was talking about 250 GByte/month which are around 8 GByte a
day or 300 MByte per hour

Thanks, Greetings and nice Day/Evening
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
<http://www.tamay-dogan.net/&gt; Michelle Konzack
<http://www.can4linux.org/&gt; c/o Vertriebsp. KabelBW
<http://www.flexray4linux.org/&gt; Blumenstrasse 2
Jabber linux4michelle@jabber.ccc.de 77694 Kehl/Germany
IRC #Debian (irc.icq.com) Tel. DE: +49 177 9351947
ICQ #328449886 Tel. FR: +33 6 61925193

#15Greg Smith
gsmith@gregsmith.com
In reply to: Michelle Konzack (#14)
Re: 10 TB database

On Tue, 16 Jun 2009, Michelle Konzack wrote:

Am 2009-06-16 12:13:20, schrieb Greg Smith:

you'll be hard pressed to keep up with 250GB/day unless you write a
custom data loader that keeps multiple cores

AFAIK he was talking about 250 GByte/month which are around 8 GByte a
day or 300 MByte per hour

Right, that was just a typo in my response, the comments reflected what he
meant. Note that your averages here presume you can spread that out over
a full 24 hour period--which you often can't, as this type of data tends
to come in a big clump after market close and needs to be loaded ASAP for
it to be useful.

It's harder than most people would guess to sustain that sort of rate
against real-world data (which even fails to import some days) in
PostgreSQL without running into a bottleneck in COPY, WAL traffic, or
database disk I/O (particularly if there's any random access stuff going
on concurrently with the load). Just because your RAID array can write at
hundreds of MB/s does not mean you'll be able to sustain anywhere close to
that during your loading.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

#16Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Martin Gainty (#10)
Re: 10 TB database

2009/6/16 Martin Gainty <mgainty@hotmail.com>:

would suggest Oracle 11 for DB of 10TB or greater
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

You are joking, right ?
Better invest that money in paying someone from -hackers to add
features required, if there will be any !
Or buy for that heftier RAID, with more disks...

--
GJ

#17Scott Marlowe
scott.marlowe@gmail.com
In reply to: Grzegorz Jaśkiewicz (#16)
Re: 10 TB database

2009/6/16 Grzegorz Jaśkiewicz <gryzman@gmail.com>:

2009/6/16 Martin Gainty <mgainty@hotmail.com>:

would suggest Oracle 11 for DB of 10TB or greater
http://www.oracle.com/solutions/performance_scalability/tpc-h-10tb-11g.html

You are joking, right ?
Better invest that money in paying someone from -hackers to add
features required, if there will be any !
Or buy for that heftier RAID, with more disks...

You can throw a metric ton of hardware and development at a problem
for the cost of an Oracle license.