why is pg_dump so much smaller than my database?

Started by Carson Grossabout 14 years ago4 messagesgeneral
Jump to latest
#1Carson Gross
carsongross@gmail.com

I've got a pretty big database (~30 gigs) and when I do a pg_dump, it ends
up only being 2 gigs.

The database consists mainly of one very large table (w/ a few varchar
columns) which, according to pg_relation_size() is 10 gigs
and pg_total_relation_size() is 26 gigs (we need to drop some indexes
there.)

I'm just trying to get my head around the pg_dump being an order of
magnitude smaller than the darned database itself. I would thing that the
db would offer more efficient encoding for a lot of stuff vs. an ascii file.

Thanks,
Carson

#2John R Pierce
pierce@hogranch.com
In reply to: Carson Gross (#1)
Re: why is pg_dump so much smaller than my database?

On 03/28/12 10:32 PM, Carson Gross wrote:

I've got a pretty big database (~30 gigs) and when I do a pg_dump, it
ends up only being 2 gigs.

The database consists mainly of one very large table (w/ a few varchar
columns) which, according to pg_relation_size() is 10 gigs
and pg_total_relation_size() is 26 gigs (we need to drop some indexes
there.)

I'm just trying to get my head around the pg_dump being an order of
magnitude smaller than the darned database itself. I would thing that
the db would offer more efficient encoding for a lot of stuff vs. an
ascii file.

its quite possible your table has a lot of free tuples scattered through
it as a result of updates or deletes. vacuum makes these available for
reuse but does NOT free the disk space. ditto, your indexes might be
very bloated, a reindex may significantly shrink them

if you can afford some application downtime, you may consider running
CLUSTER on that table, it will copy all the active tuples of the table
to new file space, and free the old, and also does the reindex
operation. I would vacuum the table first, after ensuring there aren't
any old active transactions ('IDLE IN TRANSACTION' status in
pg_stat_activity). Note that cluster takes an exclusive lock on the
table, this is why I said you need some application downtime.

you don't say what version you're running, older versions had more
problems with bloating indexes than newer ones.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

#3Alban Hertroys
haramrae@gmail.com
In reply to: John R Pierce (#2)
Re: why is pg_dump so much smaller than my database?

On 29 March 2012 09:11, John R Pierce <pierce@hogranch.com> wrote:

On 03/28/12 10:32 PM, Carson Gross wrote:

I've got a pretty big database (~30 gigs) and when I do a pg_dump, it ends
up only being 2 gigs.

I suppose you're talking about a plain text dump here? A compressed
dump would likely[*] be quite a bit smaller than your database.

And as John already says, deleted tuples do not get dumped, neither do
index contents. Any fill-factors defined on tables or indexes would
also account for a difference in size.

*: This does depend on what data you store, of course. A database that
mainly consists of stored JPEG images, for example, wouldn't compress
much. A database with the same text data over and over would compress
immensely.
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

#4Carson Gross
carsongross@gmail.com
In reply to: John R Pierce (#2)
Re: why is pg_dump so much smaller than my database?

Interesting. Is there a perf hit to having a big file on disk? My
understanding is that the primary thing that really matters is keeping your
active set in memory.

This is on Postgres 9.0.x, running on Heroku/ec2.

We do have extremely compressible data so it may be that the dump is
compressed: I'm downloading it now to check.

Thanks for the replies,
Carson

On Thu, Mar 29, 2012 at 12:11 AM, John R Pierce <pierce@hogranch.com> wrote:

Show quoted text

On 03/28/12 10:32 PM, Carson Gross wrote:

I've got a pretty big database (~30 gigs) and when I do a pg_dump, it
ends up only being 2 gigs.

The database consists mainly of one very large table (w/ a few varchar
columns) which, according to pg_relation_size() is 10 gigs and
pg_total_relation_size() is 26 gigs (we need to drop some indexes there.)

I'm just trying to get my head around the pg_dump being an order of
magnitude smaller than the darned database itself. I would thing that the
db would offer more efficient encoding for a lot of stuff vs. an ascii file.

its quite possible your table has a lot of free tuples scattered through
it as a result of updates or deletes. vacuum makes these available for
reuse but does NOT free the disk space. ditto, your indexes might be very
bloated, a reindex may significantly shrink them

if you can afford some application downtime, you may consider running
CLUSTER on that table, it will copy all the active tuples of the table to
new file space, and free the old, and also does the reindex operation. I
would vacuum the table first, after ensuring there aren't any old active
transactions ('IDLE IN TRANSACTION' status in pg_stat_activity). Note
that cluster takes an exclusive lock on the table, this is why I said you
need some application downtime.

you don't say what version you're running, older versions had more
problems with bloating indexes than newer ones.

--
john r pierce N 37, W 122
santa cruz ca mid-left coast

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/**mailpref/pgsql-general&lt;http://www.postgresql.org/mailpref/pgsql-general&gt;