Strategies/Best Practises Handling Large Tables

Started by Chitra Cretaover 13 years ago14 messagesgeneral

chitracreta@gmail.com

over 13 years ago

Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these
queries are now starting to take a very long time (hours) to execute due to
the size of the table.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

I would appreciate knowledge sharing on the pros and cons of the above, or
if there are any other strategies that I could put in place.

Thanking you in advance.

Lonni J Friedman

netllama@gmail.com

over 13 years ago

In reply to: Chitra Creta (#1)

Re: Strategies/Best Practises Handling Large Tables

On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these
queries are now starting to take a very long time (hours) to execute due to
the size of the table.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

I would appreciate knowledge sharing on the pros and cons of the above, or
if there are any other strategies that I could put in place.

Partitioning is prolly your best solution. 3 & 4 sound like
variations on the same thing. Before you go that route, you should
make sure that your bottleneck is really a result of the massive
amount of data, and not some other problem. Are you sure that the
indices you created are being used, and that you have all the indices
that you need for your queries? Look at the query plan output from
EXPLAIN, and/or post here if you're unsure.

Reindexing shouldn't make a difference unless something is wrong with
the indices that you already have in place.

Purging old data is only a good solution if you do not need the data,
and never will need the data.

Ondrej Ivanič

ondrej.ivanic@gmail.com

over 13 years ago

In reply to: Chitra Creta (#1)

Re: Strategies/Best Practises Handling Large Tables

Hi,

On 13 October 2012 01:44, Chitra Creta <chitracreta@gmail.com> wrote:

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these
queries are now starting to take a very long time (hours) to execute due to
the size of the table.

Have you changed autovacuum settings to make it more agressive?
Another options is to run analyse after loading.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
3. Partition

Those two go together. Partitioning is useful if you can constrain
queries to specific ranges ie. this query needs last two days. You
shouldn't go over 200 - 300 partitions per table. Partition
granularity should be the same as the amount of data in average query.
if you run weekly queries then use weekly partitions (bi-weekly or
daily partitions might work but I do not have good experience).

It is easy to purge old data because you need to drop unwanted
partitions (no table/index bloat). Loading is little bit tricky
becuase you have load data into right partition.

4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

I think this way to the hell. You start with few tables and then you
add more tables until you realise that it takes longer to update them
then run your queries :)

You might benefit from query parallelisation, for example, pgpool-II,
Stado, Postgres XC or do not use Postgres at all. For example, any
column oriented NoSQL database might be good choice.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)

Ryan Kelly

rpkelly22@gmail.com

over 13 years ago

In reply to: Chitra Creta (#1)

Re: Strategies/Best Practises Handling Large Tables

On Sat, Oct 13, 2012 at 01:44:02AM +1100, Chitra Creta wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these
queries are now starting to take a very long time (hours) to execute due to
the size of the table.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

Three and four will probably be your biggest wins. We do both. There are
trade-offs for both. If you need the absolute best in response times
(e.g., in a web application), summary tables are the way to go. If
you're regularly querying data in a specific date range, but response
times aren't as critical (e.g., daily report generated at the end of
day), partitioning will also help there.

Having the correct indexes is crucial, as always.

Additionally, you will be able to purge old data relatively quickly and
without bloat using the partitioning method.

I would appreciate knowledge sharing on the pros and cons of the above, or
if there are any other strategies that I could put in place.

Thanking you in advance.

-Ryan Kelly

John R Pierce

pierce@hogranch.com

over 13 years ago

In reply to: Chitra Creta (#1)

Re: Strategies/Best Practises Handling Large Tables

On 10/12/12 7:44 AM, Chitra Creta wrote:

1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

if most of your queries read the majority of the tables, indexing will
be of little help

parittioning will aid in purging old data, as you can partitions by date
(for instance, by week) and drop whole partitions rather than deleting
individual records.

aggregate tables likely will be the biggest win for your statistics if
they reduce the mount of data you need to query.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

Chitra Creta

chitracreta@gmail.com

over 13 years ago

In reply to: John R Pierce (#5)

Re: Strategies/Best Practises Handling Large Tables

Thank you all for your suggestions. Since all of you recommended the
Partition option, I decided to try it out myself.

I found a function that created partition tables for every month in the
last two years. A trigger was also added to the parent table to ensure that
every insert into it from hence forth will be inserted into the
approapriate child table.

However, there were a few observations that I made which I would appreciate
your comments on:

1. Since existing data was in the parent table, I had to do a pg_dump on
it, drop it, and then to a restore on it to force the trigger to work on
existing data. Is this how partitioning existing data should be done?

2. I noticed that there are two copies of the same record - i.e the one
that was inserted into the parent table and another that was inserted in
the child table. If I delete the record in the parent table, the child
record gets automatically deleted. I was under the impression that
partitioning meant that my parent table will not be large anymore because
the data will be moved to smaller child tables. Is this the case?

3. Is there a way for me to evaluate the effectiveness of the partitioned
table? Would performing an Explain Analyse allow me to determine whether
querying the parent table for statistics is quicker than querying against a
massive non-partitioned table?

Thank you.
On Oct 13, 2012 3:49 AM, "John R Pierce" <pierce@hogranch.com> wrote:

Show quoted text

On 10/12/12 7:44 AM, Chitra Creta wrote:

1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

if most of your queries read the majority of the tables, indexing will be
of little help

parittioning will aid in purging old data, as you can partitions by date
(for instance, by week) and drop whole partitions rather than deleting
individual records.

aggregate tables likely will be the biggest win for your statistics if
they reduce the mount of data you need to query.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/**mailpref/pgsql-general<http://www.postgresql.org/mailpref/pgsql-general>

Ryan Kelly

rpkelly22@gmail.com

over 13 years ago

In reply to: Chitra Creta (#6)

Re: Strategies/Best Practises Handling Large Tables

On Tue, Oct 16, 2012 at 09:26:09PM +1100, Chitra Creta wrote:

Thank you all for your suggestions. Since all of you recommended the
Partition option, I decided to try it out myself.

I found a function that created partition tables for every month in the
last two years. A trigger was also added to the parent table to ensure that
every insert into it from hence forth will be inserted into the
approapriate child table.

However, there were a few observations that I made which I would appreciate
your comments on:

1. Since existing data was in the parent table, I had to do a pg_dump on
it, drop it, and then to a restore on it to force the trigger to work on
existing data. Is this how partitioning existing data should be done?

I just wrote a one-time function to move it.

2. I noticed that there are two copies of the same record - i.e the one
that was inserted into the parent table and another that was inserted in
the child table. If I delete the record in the parent table, the child
record gets automatically deleted. I was under the impression that
partitioning meant that my parent table will not be large anymore because
the data will be moved to smaller child tables. Is this the case?

The data *is* in the child tables. Queries on the parent tables, by
default, affect data in the child tables. So, issuing a SELECT against
your parent table will also query the child tables. DELETE will,
similarly, delete data in the child tables. You may target just the
parent table using ONLY, e.g. SELECT * FROM ONLY foo. This behavior is
also controlled by the GUC sql_inheritance, though I encourage you not
to change this value. To get a better idea of what it happening, look at
the output from EXPLAIN to see all the tables that are being included in
your plan.

3. Is there a way for me to evaluate the effectiveness of the partitioned
table? Would performing an Explain Analyse allow me to determine whether
querying the parent table for statistics is quicker than querying against a
massive non-partitioned table?

Well, you can do it with EXPLAIN ANALYZE, or you can do it by timing
your query, so that the overhead of EXPLAIN ANALYZE does not come into
play.

Also, I assume you've read this:
http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html

That link will be helpful in understanding how partitioning could
benefit you.

-Ryan Kelly

Merlin Moncure

mmoncure@gmail.com

over 13 years ago

In reply to: Lonni J Friedman (#2)

Re: Strategies/Best Practises Handling Large Tables

On Fri, Oct 12, 2012 at 10:00 AM, Lonni J Friedman <netllama@gmail.com> wrote:

On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However, these
queries are now starting to take a very long time (hours) to execute due to
the size of the table.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

I would appreciate knowledge sharing on the pros and cons of the above, or
if there are any other strategies that I could put in place.

Partitioning is prolly your best solution. 3 & 4 sound like
variations on the same thing. Before you go that route, you should
make sure that your bottleneck is really a result of the massive
amount of data, and not some other problem. Are you sure that the
indices you created are being used, and that you have all the indices
that you need for your queries? Look at the query plan output from
EXPLAIN, and/or post here if you're unsure.

Partitioning is not a strategy to improve query performance unless you
are exploiting the data structure in some way through the partition.

merlin

Shaun Thomas

sthomas@optionshouse.com

over 13 years ago

In reply to: Chitra Creta (#1)

Re: Strategies/Best Practises Handling Large Tables

On 10/12/2012 09:44 AM, Chitra Creta wrote:

4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

If the partitioning doesn't help you, I strongly urge you to build one
or more Fact tables with appropriate Diminsions. If you don't know what
these terms mean, there are lots of books on the subject. They're very
versatile for producing fast reports on varying inputs. You can also
"layer" them by having cascading levels of aggregation from day -> month
-> year, and so on.

These kinds of reporting structures are perfect for huge data
accumulation warehouses. The book most recommended to me back in the day
was The Data Warehouse Toolkit, and I can also vouch that it covers
these subjects pretty well.

I still recommend partitioning simply due to maintenance overhead, but
you'll want to look into this too.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

#10

Chris Travers

chris.travers@gmail.com

over 13 years ago

In reply to: Chitra Creta (#1)

Re: Strategies/Best Practises Handling Large Tables

On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com> wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However,
these queries are now starting to take a very long time (hours) to execute
due to the size of the table.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

The answer is well, it depends. Possibly some combination.

One approach I like that may be included in #4 but not necessarily is the
idea of summary tables which contain snapshots of the data, allowing you to
roll forward or backward from defined points. This is what I call the log,
aggregate, and snapshot approach. But it really depends on what you are
doing and there is no one size fits all approach at this volume.

Instead of reindexing, I would suggest also looking into partial indexes.

Best Wishes,

#11

Chitra Creta

chitracreta@gmail.com

over 13 years ago

In reply to: Chris Travers (#10)

Re: Strategies/Best Practises Handling Large Tables

Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else
fail.

Chris, would you mind giving me an example of what you mean by your log,
aggregate and snapshot approach. Also, with indexing, I believe composite
and partial indexes are better than indexes, am I correct? Do you have any
recommendations as to which type (e.g btree, hash) is better for which
situations.

Thank you,
Chitra

On Thu, Oct 18, 2012 at 12:47 AM, Chris Travers <chris.travers@gmail.com>wrote:

Show quoted text

On Fri, Oct 12, 2012 at 7:44 AM, Chitra Creta <chitracreta@gmail.com>wrote:

Hi,

I currently have a table that is growing very quickly - i.e 7 million
records in 5 days. This table acts as a placeholder for statistics, and
hence the records are merely inserted and never updated or deleted.

Many queries are run on this table to obtain trend analysis. However,
these queries are now starting to take a very long time (hours) to execute
due to the size of the table.

I have put indexes on this table, to no significant benefit. Some of the
other strategies I have thought of:
1. Purge old data
2. Reindex
3. Partition
4. Creation of daily, monthly, yearly summary tables that contains
aggregated data specific to the statistics required

Does anyone know what is the best practice to handle this situation?

The answer is well, it depends. Possibly some combination.

One approach I like that may be included in #4 but not necessarily is the
idea of summary tables which contain snapshots of the data, allowing you to
roll forward or backward from defined points. This is what I call the log,
aggregate, and snapshot approach. But it really depends on what you are
doing and there is no one size fits all approach at this volume.

Instead of reindexing, I would suggest also looking into partial indexes.

Best Wishes,

#12

Chris Travers

chris.travers@gmail.com

over 13 years ago

In reply to: Chitra Creta (#11)

Re: Strategies/Best Practises Handling Large Tables

On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta <chitracreta@gmail.com> wrote:

Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else
fail.

Chris, would you mind giving me an example of what you mean by your log,
aggregate and snapshot approach. Also, with indexing, I believe composite
and partial indexes are better than indexes, am I correct? Do you have any
recommendations as to which type (e.g btree, hash) is better for which
situations.

Sure. Suppose I have an accounting system.

I may record the amounts in the transactions in a journal_entry and
journal_line table. These will be write once read many. However time you
will end up having to digest millions of records (given sufficient volume)
to find out the balance of a checking account, and this is not really ideal.

So to deal with this, I might, for example, add a table called
account_checkpoint which might have the following fields:

account_id
end_date
debits
credits
balance

And then I can snapshot on closing of books the accumulated debits,
credits, and balance to date. If I need any of these numbers I can just
grab the appropriate number from account_checkpoint and roll forward from
end_date. If I have too much volume I can have closings on a monthly level
of whatever.

The checkpoint table contains sufficient information for me to start a
report at any point and end it at any other point without having to scan
interceding checkpointed periods. Additionally if I want to purge old
data, I can do so without losing current balance information.

So what this approach does, in essence is it gives you a way to purge
without losing some aggregated information, and a way to skip portions of
the table for aggregation purposes you can't skip otherwise. The big thing
is you cannot insert (and if this is in doubt, you need to enforce this
with a trigger) any records effective before the most recent checkpoint.

Best Wishes,
Chris Travers

#13

Chitra Creta

chitracreta@gmail.com

over 13 years ago

In reply to: Chris Travers (#12)

Re: Strategies/Best Practises Handling Large Tables

Thanks for your example Chris. I will look into it as a long-term solution.

Partitioning tables as a strategy worked very well indeed. This will be my
short/medium term solution.

Another strategy that I would like to evaluate as a short/medium term
solution is archiving old records in a table before purging them.

I am aware that Oracle has a tool that allows records to be exported into a
file / archive table before purging them. They also provide a tool to
import these records.

Does PostgreSQL have similar tools to export to a file and re-import?

If PostgreSQL does not have a tool to do this, does anyone have any ideas
on what file format (e.g. text file containing a table of headers being
column names and rows being records) would be ideal for easy re-importing
into a PostgreSQL table?

Thank you for your ideas.

On Mon, Oct 22, 2012 at 12:14 AM, Chris Travers <chris.travers@gmail.com>wrote:

Show quoted text

On Sun, Oct 21, 2012 at 5:46 AM, Chitra Creta <chitracreta@gmail.com>wrote:

Thank you all.

Ryan, would you mind sharing your one-time function to move it?

Merlin, what are your suggestions to improve query performance?

Shaun, thank you. I will look into facts and dimensions should all else
fail.

Chris, would you mind giving me an example of what you mean by your log,
aggregate and snapshot approach. Also, with indexing, I believe composite
and partial indexes are better than indexes, am I correct? Do you have any
recommendations as to which type (e.g btree, hash) is better for which
situations.

Sure. Suppose I have an accounting system.

I may record the amounts in the transactions in a journal_entry and
journal_line table. These will be write once read many. However time you
will end up having to digest millions of records (given sufficient volume)
to find out the balance of a checking account, and this is not really ideal.

So to deal with this, I might, for example, add a table called
account_checkpoint which might have the following fields:

account_id
end_date
debits
credits
balance

And then I can snapshot on closing of books the accumulated debits,
credits, and balance to date. If I need any of these numbers I can just
grab the appropriate number from account_checkpoint and roll forward from
end_date. If I have too much volume I can have closings on a monthly level
of whatever.

The checkpoint table contains sufficient information for me to start a
report at any point and end it at any other point without having to scan
interceding checkpointed periods. Additionally if I want to purge old
data, I can do so without losing current balance information.

So what this approach does, in essence is it gives you a way to purge
without losing some aggregated information, and a way to skip portions of
the table for aggregation purposes you can't skip otherwise. The big thing
is you cannot insert (and if this is in doubt, you need to enforce this
with a trigger) any records effective before the most recent checkpoint.

Best Wishes,
Chris Travers

#14

Igor Romanchenko

igor.a.romanchenko@gmail.com

over 13 years ago

In reply to: Chitra Creta (#13)

Re: Strategies/Best Practises Handling Large Tables

On Thu, Nov 15, 2012 at 1:34 PM, Chitra Creta <chitracreta@gmail.com> wrote:

Thanks for your example Chris. I will look into it as a long-term solution.

Partitioning tables as a strategy worked very well indeed. This will be my
short/medium term solution.

Another strategy that I would like to evaluate as a short/medium term
solution is archiving old records in a table before purging them.

I am aware that Oracle has a tool that allows records to be exported into
a file / archive table before purging them. They also provide a tool to
import these records.

Does PostgreSQL have similar tools to export to a file and re-import?

If PostgreSQL does not have a tool to do this, does anyone have any ideas
on what file format (e.g. text file containing a table of headers being
column names and rows being records) would be ideal for easy re-importing
into a PostgreSQL table?

Thank you for your ideas.

PostgreSQL has COPY TO to export records to a file (
http://wiki.postgresql.org/wiki/COPY ).