Streaming large data into postgres [WORM like applications]

Started by Dhaval Shahalmost 19 years ago11 messagesgeneral

dhaval.shah.m@gmail.com

almost 19 years ago

Here is the straight dope, one of internal teams at my customer site
is looking into MySql and replacing its storage engine so that they
can store large amount of streamed data. The key here is that the data
they are getting is several thousands of rows in an extremely short
duration. They say that only MySql provides them the ability to
replace the storage engine, which granted is easier.

If I go with the statement that postgres can basically do what they
intend to do for handling large datasets, I need to prepare my talking
points.

The requirements are as follows:

1. Large amount of streamed rows. In the order of @50-100k rows per
second. I was thinking that the rows can be stored into a file and the
file then copied into a temp table using copy and then appending those
rows to the master table. And then dropping and recreating the index
very lazily [during the first query hit or something like that]

The table size can grow extremely large. Of course, if it can be
partitioned, either by range or list.

2. Most of the streamed rows are very similar. Think syslog rows,
where for most cases only the timestamp changes. Of course, if the
data can be compressed, it will result in improved savings in terms of
disk size.

The key issue here is that the ultimate data usage is Write Once Read
Many, and in that sense I am looking for a very optimal solution for
bulk writes and maintaining indexes during bulk writes.

So with some intelligent design, it is possible to use postgres. Any
help in preparing my talking points is appreciated.

Regards
Dhaval

Ben

bench@silentmedia.com

almost 19 years ago

In reply to: Dhaval Shah (#1)

Re: Streaming large data into postgres [WORM like applications]

Inserting 50,000 rows a second is, uh... difficult to do, no matter
what database you're using. You'll probably have to spool the inserts
and insert them as fast as you can, and just hope you don't fall too
far behind.

But I'm suspecting that you aren't going to be doing much, if any,
referential integrity checking, at least beyond basic type checking.
You probably aren't going to care about multiple inserts affecting
each other, or worry about corruption if a given insert fails... in
fact, you probably aren't even going to need transactions at all,
other than as a way to insert faster. Is SQL the right tool for you?

On May 11, 2007, at 1:43 PM, Dhaval Shah wrote:

Show quoted text

Here is the straight dope, one of internal teams at my customer site
is looking into MySql and replacing its storage engine so that they
can store large amount of streamed data. The key here is that the data
they are getting is several thousands of rows in an extremely short
duration. They say that only MySql provides them the ability to
replace the storage engine, which granted is easier.

If I go with the statement that postgres can basically do what they
intend to do for handling large datasets, I need to prepare my talking
points.

The requirements are as follows:

1. Large amount of streamed rows. In the order of @50-100k rows per
second. I was thinking that the rows can be stored into a file and the
file then copied into a temp table using copy and then appending those
rows to the master table. And then dropping and recreating the index
very lazily [during the first query hit or something like that]

The table size can grow extremely large. Of course, if it can be
partitioned, either by range or list.

2. Most of the streamed rows are very similar. Think syslog rows,
where for most cases only the timestamp changes. Of course, if the
data can be compressed, it will result in improved savings in terms of
disk size.

The key issue here is that the ultimate data usage is Write Once Read
Many, and in that sense I am looking for a very optimal solution for
bulk writes and maintaining indexes during bulk writes.

So with some intelligent design, it is possible to use postgres. Any
help in preparing my talking points is appreciated.

Regards
Dhaval

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Dhaval Shah

dhaval.shah.m@gmail.com

almost 19 years ago

In reply to: Ben (#2)

Re: Streaming large data into postgres [WORM like applications]

I do care about the following:

1. Basic type checking
2. Knowing failed inserts.
3. Non-corruption
4. Macro transactions. That is a minimal read consistency.

The following is not necessary

1. Referential integrity

In this particular scenario,

1. There is a sustained load and peak loads. As long as we can handle
peak loads, the sustained loads can be half of the quoted figure.
2. The row size has limited columns. That is, it is spans at most a
dozen or so columns and most integer or varchar.

It is more data i/o heavy rather than cpu heavy.

Regards
Dhaval

On 5/11/07, Ben <bench@silentmedia.com> wrote:

Inserting 50,000 rows a second is, uh... difficult to do, no matter
what database you're using. You'll probably have to spool the inserts
and insert them as fast as you can, and just hope you don't fall too
far behind.

But I'm suspecting that you aren't going to be doing much, if any,
referential integrity checking, at least beyond basic type checking.
You probably aren't going to care about multiple inserts affecting
each other, or worry about corruption if a given insert fails... in
fact, you probably aren't even going to need transactions at all,
other than as a way to insert faster. Is SQL the right tool for you?

On May 11, 2007, at 1:43 PM, Dhaval Shah wrote:

Here is the straight dope, one of internal teams at my customer site
is looking into MySql and replacing its storage engine so that they
can store large amount of streamed data. The key here is that the data
they are getting is several thousands of rows in an extremely short
duration. They say that only MySql provides them the ability to
replace the storage engine, which granted is easier.

If I go with the statement that postgres can basically do what they
intend to do for handling large datasets, I need to prepare my talking
points.

The requirements are as follows:

1. Large amount of streamed rows. In the order of @50-100k rows per
second. I was thinking that the rows can be stored into a file and the
file then copied into a temp table using copy and then appending those
rows to the master table. And then dropping and recreating the index
very lazily [during the first query hit or something like that]

The table size can grow extremely large. Of course, if it can be
partitioned, either by range or list.

2. Most of the streamed rows are very similar. Think syslog rows,
where for most cases only the timestamp changes. Of course, if the
data can be compressed, it will result in improved savings in terms of
disk size.

The key issue here is that the ultimate data usage is Write Once Read
Many, and in that sense I am looking for a very optimal solution for
bulk writes and maintaining indexes during bulk writes.

So with some intelligent design, it is possible to use postgres. Any
help in preparing my talking points is appreciated.

Regards
Dhaval

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Dhaval Shah

Tom Allison

tom@tacocat.net

almost 19 years ago

In reply to: Dhaval Shah (#3)

Re: [UNSURE] Re: Streaming large data into postgres [WORM like applications]

One approach would be to spool all the data to a flat file and then
pull them into the database as you are able to. This would give you
extremely high peak capability.

On May 11, 2007, at 10:35 PM, Dhaval Shah wrote:

Show quoted text

I do care about the following:

1. Basic type checking
2. Knowing failed inserts.
3. Non-corruption
4. Macro transactions. That is a minimal read consistency.

The following is not necessary

1. Referential integrity

In this particular scenario,

1. There is a sustained load and peak loads. As long as we can handle
peak loads, the sustained loads can be half of the quoted figure.
2. The row size has limited columns. That is, it is spans at most a
dozen or so columns and most integer or varchar.

It is more data i/o heavy rather than cpu heavy.

Regards
Dhaval

On 5/11/07, Ben <bench@silentmedia.com> wrote:

Inserting 50,000 rows a second is, uh... difficult to do, no matter
what database you're using. You'll probably have to spool the inserts
and insert them as fast as you can, and just hope you don't fall too
far behind.

But I'm suspecting that you aren't going to be doing much, if any,
referential integrity checking, at least beyond basic type checking.
You probably aren't going to care about multiple inserts affecting
each other, or worry about corruption if a given insert fails... in
fact, you probably aren't even going to need transactions at all,
other than as a way to insert faster. Is SQL the right tool for you?

On May 11, 2007, at 1:43 PM, Dhaval Shah wrote:

Here is the straight dope, one of internal teams at my customer

site

is looking into MySql and replacing its storage engine so that they
can store large amount of streamed data. The key here is that

the data

they are getting is several thousands of rows in an extremely short
duration. They say that only MySql provides them the ability to
replace the storage engine, which granted is easier.

If I go with the statement that postgres can basically do what they
intend to do for handling large datasets, I need to prepare my

talking

points.

The requirements are as follows:

1. Large amount of streamed rows. In the order of @50-100k rows per
second. I was thinking that the rows can be stored into a file

and the

file then copied into a temp table using copy and then appending

those

rows to the master table. And then dropping and recreating the

index

very lazily [during the first query hit or something like that]

The table size can grow extremely large. Of course, if it can be
partitioned, either by range or list.

2. Most of the streamed rows are very similar. Think syslog rows,
where for most cases only the timestamp changes. Of course, if the
data can be compressed, it will result in improved savings in

terms of

disk size.

The key issue here is that the ultimate data usage is Write Once

Read

Many, and in that sense I am looking for a very optimal solution

for

bulk writes and maintaining indexes during bulk writes.

So with some intelligent design, it is possible to use postgres.

Any

help in preparing my talking points is appreciated.

Regards
Dhaval

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Dhaval Shah

---------------------------(end of
broadcast)---------------------------
TIP 6: explain analyze is your friend

Ron Johnson

ron.l.johnson@cox.net

almost 19 years ago

In reply to: Dhaval Shah (#3)

Re: Streaming large data into postgres [WORM like applications]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/11/07 21:35, Dhaval Shah wrote:

I do care about the following:

1. Basic type checking
2. Knowing failed inserts.
3. Non-corruption
4. Macro transactions. That is a minimal read consistency.

The following is not necessary

1. Referential integrity

In this particular scenario,

1. There is a sustained load and peak loads. As long as we can handle
peak loads, the sustained loads can be half of the quoted figure.
2. The row size has limited columns. That is, it is spans at most a
dozen or so columns and most integer or varchar.

It is more data i/o heavy rather than cpu heavy.

Have you tested PG (and MySQL, for that matter) to determine what
kind of load they can handle on existing h/w?

Back to the original post: 100K inserts/second is 360 *million*
inserts per hour. That's a *lot*. Even if the steady-state is 50K
inserts/sec that's 180M inserts/hr. If each record is 120 bytes,
that's 43 gigabytes per hour. Which is 12MB/second. No problem
from a h/w standpoint.

However, it will fill a 300GB HDD in 7 hours.

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGRZdhS9HxQb37XmcRAimAAJ9oi5AG0EcyATxeGDrlA1qdqU7krwCfc0k+
J7zMkiJiVKxS+DWM6I6Oujw=
=D04k
-----END PGP SIGNATURE-----

Lincoln Yeoh

lyeoh@pop.jaring.my

almost 19 years ago

In reply to: Dhaval Shah (#1)

Re: Streaming large data into postgres [WORM like applications]

At 04:43 AM 5/12/2007, Dhaval Shah wrote:

1. Large amount of streamed rows. In the order of @50-100k rows per
second. I was thinking that the rows can be stored into a file and the
file then copied into a temp table using copy and then appending those
rows to the master table. And then dropping and recreating the index
very lazily [during the first query hit or something like that]

Is it one process inserting or can it be many processes?

Is it just a short (relatively) high burst or is that rate sustained
for a long time? If it's sustained I don't see the point of doing so
many copies.

How many bytes per row? If the rate is sustained and the rows are big
then you are going to need LOTs of disks (e.g. a large RAID10).

When do you need to do the reads, and how up to date do they need to be?

Regards,
Link.

Dhaval Shah

dhaval.shah.m@gmail.com

almost 19 years ago

In reply to: Lincoln Yeoh (#6)

Re: Streaming large data into postgres [WORM like applications]

Consolidating my responses in one email.

1. The total data that is expected is some 1 - 1.5 Tb a day. 75% of
the data comes in a period of 10 hours. Rest 25% comes in the 14
hours. Of course there are ways to smooth the load patterns, however
the current scenario is as explained.

2 I do expect that the customer rolls in something like a NAS/SAN with
Tb of disk space. The idea is to retain the data for a duration and
offload it to tape.

That leads to the question, can the data be compressed? Since the data
is very similar, any compression would result in some 6x-10x
compression. Is there a way to identify which partitions are in which
data files and compress them until they are actually read?

Regards
Dhaval

On 5/12/07, Lincoln Yeoh <lyeoh@pop.jaring.my> wrote:

At 04:43 AM 5/12/2007, Dhaval Shah wrote:

1. Large amount of streamed rows. In the order of @50-100k rows per
second. I was thinking that the rows can be stored into a file and the
file then copied into a temp table using copy and then appending those
rows to the master table. And then dropping and recreating the index
very lazily [during the first query hit or something like that]

Is it one process inserting or can it be many processes?

Is it just a short (relatively) high burst or is that rate sustained
for a long time? If it's sustained I don't see the point of doing so
many copies.

How many bytes per row? If the rate is sustained and the rows are big
then you are going to need LOTs of disks (e.g. a large RAID10).

When do you need to do the reads, and how up to date do they need to be?

Regards,
Link.

--
Dhaval Shah

Kevin Hunter

hunteke@earlham.edu

almost 19 years ago

In reply to: Dhaval Shah (#7)

Re: Streaming large data into postgres [WORM like applications]

At 8:49p on 12 May 2007, Dhaval Shah wrote:

That leads to the question, can the data be compressed? Since the data
is very similar, any compression would result in some 6x-10x
compression. Is there a way to identify which partitions are in which
data files and compress them until they are actually read?

There was a very interesting article in ;login: magazine in April of
this year discussing how they dealt with an exorbitant amount of largely
similar data. The article claimed that through aggregation and gzip
compression, they were able to reduce what they needed to store by
roughly 350x, or about .7 bytes per 'event'. The article is

The Secret Lives of Computers Exposed: Flight Data Recorder for Windows
by Chad Verbowski

You might try to get your mitts on that article for some ideas. I'm not
sure you could apply any of their ideas directly to the Postgres backend
data files, but perhaps somewhere in your pipeline.

Kevin

Ron Johnson

ron.l.johnson@cox.net

almost 19 years ago

In reply to: Dhaval Shah (#7)

Re: Streaming large data into postgres [WORM like applications]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 05/12/07 19:49, Dhaval Shah wrote:

Consolidating my responses in one email.

1. The total data that is expected is some 1 - 1.5 Tb a day. 75% of
the data comes in a period of 10 hours. Rest 25% comes in the 14
hours. Of course there are ways to smooth the load patterns, however
the current scenario is as explained.

2 I do expect that the customer rolls in something like a NAS/SAN with
Tb of disk space. The idea is to retain the data for a duration and
offload it to tape.

45TB per month???? Wow.

The archival process *must* be considered when designing the system.

PostgreSQL's ability to use tablespaces and partitioned tables will
make that much easier. Otherwise, you'd have to be deleting from
one "side" of the table while inserting into the other "side".

Partitioning will also let you divide the table into multiple
"active" segments, so that multiple inserters can run simultaneously
without stepping on each other while spreading the load across
multiple controllers and RAID-sets.

If it's a SAN/NAS that is organized into RAID-5 groups, make *sure*
that it has *lots* of batter-backed write-back cache.

Regarding compression: if the columns are integers or short VARCHAR
fields, I do not see how compression can help you, unless you use
block-layer compression.

Which Linux doesn't do.

Does FreeBSD have block-level compression?

- --
Ron Johnson, Jr.
Jefferson LA USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGR8mtS9HxQb37XmcRAjXQAJ9TN2FqU1Wo4PZmS6MAhxaJgCm6/wCfXIl8
wZOYG7vWxwODNaRwDGSJxYQ=
=Qh2r
-----END PGP SIGNATURE-----

#10

John D. Burger

john@mitre.org

almost 19 years ago

In reply to: Dhaval Shah (#1)

Re: Streaming large data into postgres [WORM like applications]

Dhaval Shah wrote:

2. Most of the streamed rows are very similar. Think syslog rows,
where for most cases only the timestamp changes. Of course, if the
data can be compressed, it will result in improved savings in terms of
disk size.

If it really is usually just the timestamp that changes, one way to
"compress" such data might be to split your logical row into two
tables. First table has all the original columns but the timestanp,
plus an ID. Second table has the timestamp and a foreign key into
the first table. Depending on how wide your original row is, and how
often it's only the timestamp that changes, this could result in
decent "compression".

Of course, now you need referential integrity.

- John D. Burger
MITRE

#11

Alban Hertroys

alban@magproductions.nl

almost 19 years ago

In reply to: John D. Burger (#10)

Re: Streaming large data into postgres [WORM like applications]

John D. Burger wrote:

Dhaval Shah wrote:

2. Most of the streamed rows are very similar. Think syslog rows,
where for most cases only the timestamp changes. Of course, if the
data can be compressed, it will result in improved savings in terms of
disk size.

If it really is usually just the timestamp that changes, one way to
"compress" such data might be to split your logical row into two
tables. First table has all the original columns but the timestanp,
plus an ID. Second table has the timestamp and a foreign key into the
first table. Depending on how wide your original row is, and how often
it's only the timestamp that changes, this could result in decent
"compression".

Of course, now you need referential integrity.

I thought of something similar. Maybe you could put those timestamps in
an array column; saves you a referential integrity check that you don't
seem to need very much.

OTOH, _if_ your log messages(?) look very similar each time, you may be
able to turn the problem around; you store unique log messages, with the
timestamps that they occured on. That way you rarely need to store more
than a timestamp. It'll add time to look up the matching log message
(there'll be quite a few less of them though). I'm quite confident
you'll save time inserting records this way, although that's hand waving
at this point.

You may be able to parameterize some log messages and store the
parameters with the timestamps. Takes a bit more processing though
(regular expression matches maybe?), and you'll want to now all
different log message permutations beforehand.

Regards,
--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede

// Integrate Your World //