huge price database question..

Started by Jim Greenabout 14 years ago33 messagesgeneral

student.northwestern@gmail.com

about 14 years ago

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Michael Nolan

htfoot@gmail.com

about 14 years ago

In reply to: Jim Green (#1)

Re: huge price database question..

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

Are you committing each insert separately or doing them in batches using
'begin transaction' and 'commit'?

I have a database that I do inserts in from a text file. Doing a commit
every 1000 transactions cut the time by over 90%.
--
Mike Nolan

Steve Crawford

scrawford@pinpointresearch.com

about 14 years ago

In reply to: Jim Green (#1)

Re: huge price database question..

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql

Welcome.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow.

I'm not sure I understand - please expand and clarify.

If you have 7000 stocks, 1.2 million rows/day is only 171
observations/stock/day or a little under 3-hours of data at 60
samples/hour. Are there holes in the data or am I completely missing the
setup?

For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up
to 8 trading holidays), 6.5 trading-hours/day giving 100,000
rows/stock/year give or take depending on holiday schedule. Over your
time range, that will be less than 1-million rows per stock.

Without holes in the data, it seems you will be adding 2.7 million rows
per trading day.

Is perl doing individual record inserts? Short of reconnecting for each
insert, this is the slowest option. Each insert is its own transaction
and, unless you are using a raid card with writeback enabled (and
hopefully with battery-backed cache if it is) you will hit some
limitations imposed by your disk's rotational speed. If you can ensure
the data is sorted by stock, you can start a transaction, write all the
records for that stock, then commit the transaction which should speed
the operation.

Only because you are bulk adding historical data and can rebuild your
database if it is destroyed, you could consider turning off fsync while
you are importing historical data. Dropping indexes while you play
catch-up can help as well.

I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

If you already have daily files, you might want to work out the process
of importing a day and apply that to your backlog. There are lots of
options - you might try experimenting.

Given the type of queries you plan, simple partitioning by time period
as you described is a reasonable approach.

You could import a day's data as a bulk copy then select from that table
into the various individual stock tables. BTW, you have plenty of RAM
that that daily bulk file will be cached and queries on each individual
stock will probably be quite fast. But you will need to wrap things in a
transaction or otherwise ensure that you can recover if things fail
part-way through that distribution of data.

You could partition your data by stock symbol and use a trigger on the
parent to put the data into the correct table on insert. I am unsure how
fast this will run - try it and test. One benefit is that your daily
import will fully succeed or fully fail. But you will need to update
your partitioning and triggers to deal with new stocks. You can, of
course, have the trigger choose which table to use based on the table
name. This eliminates the need to alter the trigger code but it is still
recommended to make the child-table in advance.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Have you started with basic tuning. It is unlikely that whatever stock
PostgreSQL you have installed is suboptimal. (What PG version and OS are
you using?)

Cheers,
Steve

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: Michael Nolan (#2)

Re: huge price database question..

On 20 March 2012 19:45, Michael Nolan <htfoot@gmail.com> wrote:

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

Are you committing each insert separately or doing them in batches using
'begin transaction' and 'commit'?

I have a database that I do inserts in from a text file. Doing a commit
every 1000 transactions cut the time by over 90%.

I use perl dbi and prepared statement. also I set
shared_buffers = 4GB
work_mem = 1GB
synchronous_commit = off
effective_cache_size = 8GB
fsync=off
full_page_writes = off

when I do the insert.

Thanks!

Show quoted text

--
Mike Nolan

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: Steve Crawford (#3)

Re: huge price database question..

On 20 March 2012 20:19, Steve Crawford <scrawford@pinpointresearch.com> wrote:

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql

Welcome.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow.

I'm not sure I understand - please expand and clarify.

If you have 7000 stocks, 1.2 million rows/day is only 171
observations/stock/day or a little under 3-hours of data at 60 samples/hour.
Are there holes in the data or am I completely missing the setup?

Hi:
stocks are being delisted and added so not everyday I have 7000 stocks
and 7000 is just a rough number. also lots of them are not liquid so
holes are possible..

For the NYSE you have 52-weeks/year, 5 trading-days/week (except for up to 8
trading holidays), 6.5 trading-hours/day giving 100,000 rows/stock/year give
or take depending on holiday schedule. Over your time range, that will be
less than 1-million rows per stock.

Without holes in the data, it seems you will be adding 2.7 million rows per
trading day.

I spot checked 0302's data it has 1.2 million rows in it and would be
similar for other dates.

Is perl doing individual record inserts? Short of reconnecting for each
insert, this is the slowest option. Each insert is its own transaction and,
unless you are using a raid card with writeback enabled (and hopefully with
battery-backed cache if it is) you will hit some limitations imposed by your
disk's rotational speed. If you can ensure the data is sorted by stock, you
can start a transaction, write all the records for that stock, then commit
the transaction which should speed the operation.

I use the per dbi and prepared statement to insert to table per symbol.

Only because you are bulk adding historical data and can rebuild your
database if it is destroyed, you could consider turning off fsync while you
are importing historical data. Dropping indexes while you play catch-up can
help as well.

I already turn fsync off.. but autovacuum uses lots of io and my 8
core cpu is really busy..

I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

If you already have daily files, you might want to work out the process of
importing a day and apply that to your backlog. There are lots of options -
you might try experimenting.

Given the type of queries you plan, simple partitioning by time period as
you described is a reasonable approach.

You could import a day's data as a bulk copy then select from that table
into the various individual stock tables. BTW, you have plenty of RAM that
that daily bulk file will be cached and queries on each individual stock
will probably be quite fast. But you will need to wrap things in a
transaction or otherwise ensure that you can recover if things fail part-way
through that distribution of data.

You could partition your data by stock symbol and use a trigger on the
parent to put the data into the correct table on insert. I am unsure how

I've read thousands of partition might be too much..

fast this will run - try it and test. One benefit is that your daily import
will fully succeed or fully fail. But you will need to update your
partitioning and triggers to deal with new stocks. You can, of course, have
the trigger choose which table to use based on the table name. This
eliminates the need to alter the trigger code but it is still recommended to
make the child-table in advance.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Have you started with basic tuning. It is unlikely that whatever stock
PostgreSQL you have installed is suboptimal. (What PG version and OS are you
using?)

Yes I do, I posted it in last reply. I run debian squeeze 64bit and
9.1.3 version PG..

It looks like alternatives are kind of complex to me, right now my
approach(perl dbi and prepared insert) would take about 8/9 mins to
insert a day's data. I think I'll probably just stick with it and
wait.

the autovacuum processes does a lot of io and make my pc unusable
while I do the data inserts.. and I tested autovacuum off with not
much success because of they are launched due to the transaction id
wrap around issue.

Thanks!

Jim.

Show quoted text

Cheers,
Steve

David Kerr

dmk@mr-paradox.net

about 14 years ago

In reply to: Jim Green (#1)

Re: huge price database question..

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Seems like you'd want to do this?
http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
<http://search.cpan.org/%7Eturnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata>
COPY support

DBD::Pg allows for quick (bulk) reading and storing of data by using the
COPY command. The basic process is to use $dbh->do to issue a COPY
command, and then to either add rows using "pg_putcopydata", or to read
them by using "pg_getcopydata".

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: David Kerr (#6)

Re: huge price database question..

On 20 March 2012 21:40, David Kerr <dmk@mr-paradox.net> wrote:

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Seems like you'd want to do this?
http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
COPY support

DBD::Pg allows for quick (bulk) reading and storing of data by using the
COPY command. The basic process is to use $dbh->do to issue a COPY command,
and then to either add rows using "pg_putcopydata", or to read them by using
"pg_getcopydata".

Thanks! would you comment on the table setup as well?

Jim.

Show quoted text

Brent Wood

Brent.Wood@niwa.co.nz

about 14 years ago

In reply to: Jim Green (#7)

Re: huge price database question..

Also look at a clustered index on timestamp

Brent Wood

GIS/DBA consultant
NIWA
+64 (4) 4 386-0300
________________________________________
From: pgsql-general-owner@postgresql.org [pgsql-general-owner@postgresql.org] on behalf of Jim Green [student.northwestern@gmail.com]
Sent: Wednesday, March 21, 2012 2:50 PM
To: David Kerr
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] huge price database question..

On 20 March 2012 21:40, David Kerr <dmk@mr-paradox.net> wrote:

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Seems like you'd want to do this?
http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
COPY support

DBD::Pg allows for quick (bulk) reading and storing of data by using the
COPY command. The basic process is to use $dbh->do to issue a COPY command,
and then to either add rows using "pg_putcopydata", or to read them by using
"pg_getcopydata".

Thanks! would you comment on the table setup as well?

Jim.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Please consider the environment before printing this email.
NIWA is the trading name of the National Institute of Water & Atmospheric Research Ltd.

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: Brent Wood (#8)

Re: huge price database question..

On 20 March 2012 21:54, Brent Wood <Brent.Wood@niwa.co.nz> wrote:

Also look at a clustered index on timestamp

Thanks, this looks very helpful. what do you think about the thousands
table vs one table partitioned by month? I guess if I go with one
table, index would be too big to fit in ram?

Jim.

#10

David Kerr

dmk@mr-paradox.net

about 14 years ago

In reply to: Jim Green (#7)

Re: huge price database question..

On 03/20/2012 06:50 PM, Jim Green wrote:

On 20 March 2012 21:40, David Kerr<dmk@mr-paradox.net> wrote:

On 03/20/2012 04:27 PM, Jim Green wrote:

Greetings list!
I am pretty new to postgresql from mysql and did a fairly extensive
search of the list and came up with a few good ones but didn't find
the exact same situation as I have now. so I am venturing asking here.

I have daily minute stock price data from 2005 on and each day with
columns timestamp, open,high,low,close,volume and a few more. each
day's data is about 1.2million rows. I want import all the data to
postgresql and analyze using R with the help of Rpostgresql.

right now I am having about 7000 tables for individual stock and I use
perl to do inserts, it's very slow. I would like to use copy or other
bulk loading tool to load the daily raw gz data. but I need the split
the file to per stock files first before I do bulk loading. I consider
this a bit messy.

I would seek advise on the following idea:
store everything in a big table, partition by month(this gives a
reasonable number of partitions) and do bulk loading on the daily
file. my queries would consist mostly select on a particular symbol on
a particular day.

Also in the future, I will import daily data to the db every day.

my hardware is 16G Ram, 4x5400rpm raid10 with enough space.

Thanks!

Jim.

Seems like you'd want to do this?
http://search.cpan.org/~turnstep/DBD-Pg-2.19.2/Pg.pm#pg_putcopydata
COPY support

DBD::Pg allows for quick (bulk) reading and storing of data by using the
COPY command. The basic process is to use $dbh->do to issue a COPY command,
and then to either add rows using "pg_putcopydata", or to read them by using
"pg_getcopydata".

Thanks! would you comment on the table setup as well?

Jim.

\copy on 1.2million rows should only take a minute or two, you could
make that table "unlogged"
as well to speed it up more. If you could truncate / drop / create /
load / then index the table each
time then you'll get the best throughput.

Dave

#11

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: David Kerr (#10)

Re: huge price database question..

On 20 March 2012 22:03, David Kerr <dmk@mr-paradox.net> wrote:

\copy on 1.2million rows should only take a minute or two, you could make
that table "unlogged"
as well to speed it up more. If you could truncate / drop / create / load /
then index the table each
time then you'll get the best throughput.

Thanks, Could you explain on the "runcate / drop / create / load /
then index the table each time then you'll get the best throughput."
part.. or point me to some docs?..

Jim

Show quoted text

Dave

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#12

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: Jim Green (#11)

Re: huge price database question..

On 20 March 2012 22:08, Jim Green <student.northwestern@gmail.com> wrote:

On 20 March 2012 22:03, David Kerr <dmk@mr-paradox.net> wrote:

\copy on 1.2million rows should only take a minute or two, you could make
that table "unlogged"
as well to speed it up more. If you could truncate / drop / create / load /
then index the table each
time then you'll get the best throughput.

Thanks, Could you explain on the "runcate / drop / create / load /
then index the table each time then you'll get the best throughput."
part.. or point me to some docs?..

Also if I use copy, I would be tempted to go the one table route, or
else I need to parse my raw daily file, separate to individual symbol
file and copy to individual table for each symbol(this sounds like not
very efficient)..

Show quoted text

Jim

Dave

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13

David Kerr

dmk@mr-paradox.net

about 14 years ago

In reply to: Jim Green (#11)

Re: huge price database question..

On 03/20/2012 07:08 PM, Jim Green wrote:

On 20 March 2012 22:03, David Kerr<dmk@mr-paradox.net> wrote:

\copy on 1.2million rows should only take a minute or two, you could make
that table "unlogged"
as well to speed it up more. If you could truncate / drop / create / load /
then index the table each
time then you'll get the best throughput.

Thanks, Could you explain on the "runcate / drop / create / load /
then index the table each time then you'll get the best throughput."
part.. or point me to some docs?..

Jim

I'm imagining that you're loading the raw file into a temporary table
that you're going to use to
process / slice new data data into your 7000+ actual tables per stock.

So that table doesn't probably need to be around once you've processed
your stocks through
that table. so you could just truncate/drop it after you're done.

When you create it, if you avoid indexes the inserts will be faster (it
doesn't have to rebuild the index every
insert) so then once the table is loaded, you create the indexes (So
it's actually useful) and then process the
data into the various stock tables.

Dave

#14

John R Pierce

pierce@hogranch.com

about 14 years ago

In reply to: Jim Green (#12)

Re: huge price database question..

On 03/20/12 7:12 PM, Jim Green wrote:

Also if I use copy, I would be tempted to go the one table route, or
else I need to parse my raw daily file, separate to individual symbol
file and copy to individual table for each symbol(this sounds like not
very efficient)..

your 7000 tables all contain the exact same information, with the only
difference being the stock ticker symbol, right? then really, the
single table, perhaps partitioned by month or whatever, is the right way
to go. Any schema that makes you have to build SQL statements from
strings for each query is designed wrong.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#15

Andy Colson

andy@squeakycode.net

about 14 years ago

In reply to: Jim Green (#12)

Re: huge price database question..

On 03/20/2012 09:12 PM, Jim Green wrote:

On 20 March 2012 22:08, Jim Green<student.northwestern@gmail.com> wrote:

On 20 March 2012 22:03, David Kerr<dmk@mr-paradox.net> wrote:

\copy on 1.2million rows should only take a minute or two, you could make
that table "unlogged"
as well to speed it up more. If you could truncate / drop / create / load /
then index the table each
time then you'll get the best throughput.

Thanks, Could you explain on the "runcate / drop / create / load /
then index the table each time then you'll get the best throughput."
part.. or point me to some docs?..

Also if I use copy, I would be tempted to go the one table route, or
else I need to parse my raw daily file, separate to individual symbol
file and copy to individual table for each symbol(this sounds like not
very efficient)..

Jim

Dave

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

I think the decisions:

1) one big table
2) one big partitioned table
3) many little tables

would probably depend on how you want to read the data. Writing would be very similar.

I tried to read through the thread but didnt see how you're going to read.

I have apache logs in a database. Single table, about 18 million rows. I have an index on hittime (its a timestamp), and I can pull a few hundred records based on a time, very fast. On the other hand, a count(*) on the entire table takes a while. If you are going to hit lots and lots of records, I think the multi-table (which include partitioning) would be faster. If you can pull out records based on index, and be very selective, then one big table works fine.

On the perl side, use copy. I have code in perl that uses it (and reads from .gz as well), and its very fast. I can post some if you'd like.

-Andy

#16

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: David Kerr (#13)

Re: huge price database question..

On 20 March 2012 22:21, David Kerr <dmk@mr-paradox.net> wrote:

I'm imagining that you're loading the raw file into a temporary table that
you're going to use to
process / slice new data data into your 7000+ actual tables per stock.

Thanks! would "slice new data data into your 7000+ actual tables per
stock." be a relatively quick operation?

Show quoted text

So that table doesn't probably need to be around once you've processed your
stocks through
that table. so you could just truncate/drop it after you're done.

When you create it, if you avoid indexes the inserts will be faster (it
doesn't have to rebuild the index every
insert) so then once the table is loaded, you create the indexes (So it's
actually useful) and then process the
data into the various stock tables.

Dave

#17

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: John R Pierce (#14)

Re: huge price database question..

On 20 March 2012 22:22, John R Pierce <pierce@hogranch.com> wrote:

your 7000 tables all contain the exact same information, with the only
difference being the stock ticker symbol, right? then really, the single
table, perhaps partitioned by month or whatever, is the right way to go.
Any schema that makes you have to build SQL statements from strings for
each query is designed wrong.

each table contains 1 minute price data for that symbol, so each table
has the same schema, open, high,low,close and volume etc, but not the
same data..

Thanks.

Show quoted text

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#18

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: Andy Colson (#15)

Re: huge price database question..

On 20 March 2012 22:25, Andy Colson <andy@squeakycode.net> wrote:

I think the decisions:

1) one big table
2) one big partitioned table
3) many little tables

would probably depend on how you want to read the data. Writing would be
very similar.

I tried to read through the thread but didnt see how you're going to read.

I have apache logs in a database. Single table, about 18 million rows. I
have an index on hittime (its a timestamp), and I can pull a few hundred
records based on a time, very fast. On the other hand, a count(*) on the
entire table takes a while. If you are going to hit lots and lots of
records, I think the multi-table (which include partitioning) would be
faster. If you can pull out records based on index, and be very selective,
then one big table works fine.
On the perl side, use copy. I have code in perl that uses it (and reads
from .gz as well), and its very fast. I can post some if you'd like.

my queries would mostly consider select for one symbol for one
particular day or a few hours in a particular day, occasionally I
would do select on multiple symbols for some timestamp range. you code
sample would be appreciated, Thanks!

Jim.

Show quoted text

-Andy

#19

Andy Colson

andy@squeakycode.net

about 14 years ago

In reply to: Jim Green (#18)

Re: huge price database question..

On 03/20/2012 09:35 PM, Jim Green wrote:

On 20 March 2012 22:25, Andy Colson<andy@squeakycode.net> wrote:

I think the decisions:

1) one big table
2) one big partitioned table
3) many little tables

would probably depend on how you want to read the data. Writing would be
very similar.

I tried to read through the thread but didnt see how you're going to read.

I have apache logs in a database. Single table, about 18 million rows. I
have an index on hittime (its a timestamp), and I can pull a few hundred
records based on a time, very fast. On the other hand, a count(*) on the
entire table takes a while. If you are going to hit lots and lots of
records, I think the multi-table (which include partitioning) would be
faster. If you can pull out records based on index, and be very selective,
then one big table works fine.
On the perl side, use copy. I have code in perl that uses it (and reads
from .gz as well), and its very fast. I can post some if you'd like.

my queries would mostly consider select for one symbol for one
particular day or a few hours in a particular day, occasionally I
would do select on multiple symbols for some timestamp range. you code
sample would be appreciated, Thanks!

Jim.

-Andy

Here is some copy/pasted parts:

my @list = glob('*.gz');
for my $fname (@list)
{
$db->do('copy access from stdin');
open my $fh, "-|", "/usr/bin/zcat $fname" or die "$fname: $!";
while (<$fh>)
{
# bunch of stuff to format sniped here
# if you have comma separated or something you might be able
# to just feed it in
$db->pg_putcopydata("$county\t$ip\t$time\t$status\t$size\t$url\t$ua\n");
}
$db->pg_endcopy;
$db->commit;
}

Do you ever plan on batch deleted a BUNCH of records?

Do you ever want to do read all of one symbol (like, select avg(high) from stocks where symbol = 'bob')?

-Andy

#20

Jim Green

student.northwestern@gmail.com

about 14 years ago

In reply to: Andy Colson (#19)

Re: huge price database question..

On 20 March 2012 22:43, Andy Colson <andy@squeakycode.net> wrote:

Here is some copy/pasted parts:

my @list = glob('*.gz');
for my $fname (@list)
{
$db->do('copy access from stdin');
open my $fh, "-|", "/usr/bin/zcat $fname" or die "$fname: $!";
while (<$fh>)
{
# bunch of stuff to format sniped here
# if you have comma separated or something you might be able
# to just feed it in

$db->pg_putcopydata("$county\t$ip\t$time\t$status\t$size\t$url\t$ua\n");
}
$db->pg_endcopy;
$db->commit;
}

Do you ever plan on batch deleted a BUNCH of records?