Determining size of a database before dumping

Started by Madison Kellyover 19 years ago10 messagesgeneral
Jump to latest
#1Madison Kelly
linux@alteeve.com

Hi all,

I am (re)writing a backup program and I want to add a section for
backing up pSQL DBs. In the planning steps (making sure a given
destination has enough space) I try to calculate how much space will be
needed by a 'pg_dump' run *before* actually dumping it.

Is there a relatively easy way to do that? Moreso, if it possible to
do this from an unpriviledged account? If not, is there a way to add the
permissions to a specific pg user to allow that user to perform this?

Thanks in advance!

Madison

#2Steve Wampler
swampler@noao.edu
In reply to: Madison Kelly (#1)
Re: Determining size of a database before dumping

Madison Kelly wrote:

Hi all,

I am (re)writing a backup program and I want to add a section for
backing up pSQL DBs. In the planning steps (making sure a given
destination has enough space) I try to calculate how much space will be
needed by a 'pg_dump' run *before* actually dumping it.

I suppose:

pg_dump $PGD_OPTIONS | wc -c

isn't efficient enough, right? Without knowing the options you plan
to use with pg_dump (compression? dump just tables?, etc.) this is
going to be hard to get a decent estimate from...

--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

#3Alexander Staubo
alex@purefiction.net
In reply to: Madison Kelly (#1)
Re: Determining size of a database before dumping

On Oct 2, 2006, at 22:17 , Madison Kelly wrote:

I am (re)writing a backup program and I want to add a section for
backing up pSQL DBs. In the planning steps (making sure a given
destination has enough space) I try to calculate how much space
will be needed by a 'pg_dump' run *before* actually dumping it.

Is there a relatively easy way to do that? Moreso, if it possible
to do this from an unpriviledged account? If not, is there a way to
add the permissions to a specific pg user to allow that user to
perform this?

You could dump the database to /dev/null, piping it through wc to
catch the size, but that would of course be wasteful.

You could count the disk space usage of the actual stored tuples,
though this will necessarily be inexact:

http://www.postgresql.org/docs/8.1/static/diskusage.html

Or you could count the size of the physical database files (/var/lib/
postgresql or wherever). While these would be estimates, you could at
least guarantee that the dump would not *exceed* the esimtate.

Keep in mind that pg_dump can compress the dump and (iirc) will do so
by default when you use the custom format (-Fc or --format=c).

Alexander.

#4Madison Kelly
linux@alteeve.com
In reply to: Steve Wampler (#2)
Re: Determining size of a database before dumping

Steve Wampler wrote:

Madison Kelly wrote:

Hi all,

I am (re)writing a backup program and I want to add a section for
backing up pSQL DBs. In the planning steps (making sure a given
destination has enough space) I try to calculate how much space will be
needed by a 'pg_dump' run *before* actually dumping it.

I suppose:

pg_dump $PGD_OPTIONS | wc -c

isn't efficient enough, right? Without knowing the options you plan
to use with pg_dump (compression? dump just tables?, etc.) this is
going to be hard to get a decent estimate from...

For now, lets assume I am doing a raw dump (no compression) and no fancy
switches. I would probably err of the side of caution and try dumping
OIDs and all schema (plus whatever else is needed to insure a full
restore to a clean DB).

I could try piping the dump into something like 'wc' but with very large
DBs I'd be worried about the (tremendous) disk I/O that would cause.
This is also why I am hoping Pg keeps this info somewhere.

Madison

#5Madison Kelly
linux@alteeve.com
In reply to: Alexander Staubo (#3)
Re: Determining size of a database before dumping

Alexander Staubo wrote:

On Oct 2, 2006, at 22:17 , Madison Kelly wrote:

I am (re)writing a backup program and I want to add a section for
backing up pSQL DBs. In the planning steps (making sure a given
destination has enough space) I try to calculate how much space will
be needed by a 'pg_dump' run *before* actually dumping it.

Is there a relatively easy way to do that? Moreso, if it possible to
do this from an unpriviledged account? If not, is there a way to add
the permissions to a specific pg user to allow that user to perform this?

You could dump the database to /dev/null, piping it through wc to catch
the size, but that would of course be wasteful.

You could count the disk space usage of the actual stored tuples, though
this will necessarily be inexact:

http://www.postgresql.org/docs/8.1/static/diskusage.html

Or you could count the size of the physical database files
(/var/lib/postgresql or wherever). While these would be estimates, you
could at least guarantee that the dump would not *exceed* the esimtate.

Keep in mind that pg_dump can compress the dump and (iirc) will do so by
default when you use the custom format (-Fc or --format=c).

Alexander.

Heh, that looks like just the article I would have wanted if I had
properly RTFM. :D

Many thanks!!

Madi

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Staubo (#3)
Re: Determining size of a database before dumping

Alexander Staubo <alex@purefiction.net> writes:

You could count the disk space usage of the actual stored tuples,
though this will necessarily be inexact:
http://www.postgresql.org/docs/8.1/static/diskusage.html
Or you could count the size of the physical database files (/var/lib/
postgresql or wherever). While these would be estimates, you could at
least guarantee that the dump would not *exceed* the esimtate.

You could guarantee no such thing; consider compression of TOAST values.
Even for uncompressed data, datatypes such as int and float can easily
print as more bytes than they occupy on-disk.

Given all the non-data overhead involved (eg for indexes), it's probably
unlikely that a text dump would exceed the "du" size of the database,
but it's far from "guaranteed".

regards, tom lane

#7Alexander Staubo
alex@purefiction.net
In reply to: Tom Lane (#6)
Re: Determining size of a database before dumping

On Oct 2, 2006, at 23:19 , Tom Lane wrote:

Alexander Staubo <alex@purefiction.net> writes:

You could count the disk space usage of the actual stored tuples,
though this will necessarily be inexact:
http://www.postgresql.org/docs/8.1/static/diskusage.html
Or you could count the size of the physical database files (/var/lib/
postgresql or wherever). While these would be estimates, you could at
least guarantee that the dump would not *exceed* the esimtate.

You could guarantee no such thing; consider compression of TOAST
values.
Even for uncompressed data, datatypes such as int and float can easily
print as more bytes than they occupy on-disk.

Why does pg_dump serialize data less efficiently than PostgreSQL when
using the "custom" format? (Pg_dump arguably has greater freedom in
being able to apply space-saving optimizations to the output format.
For example, one could use table statistics to selectively apply
something like Rice coding for numeric data, or vertically decompose
the tuples and emit sorted vectors using delta compression.) As for
TOAST, should not pg_dump's compression compress just as well, or
better?

Alexander.

#8Ron Johnson
ron.l.johnson@cox.net
In reply to: Tom Lane (#6)
Re: Determining size of a database before dumping

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/02/06 16:19, Tom Lane wrote:

Alexander Staubo <alex@purefiction.net> writes:

You could count the disk space usage of the actual stored tuples,
though this will necessarily be inexact:
http://www.postgresql.org/docs/8.1/static/diskusage.html
Or you could count the size of the physical database files (/var/lib/
postgresql or wherever). While these would be estimates, you could at
least guarantee that the dump would not *exceed* the esimtate.

You could guarantee no such thing; consider compression of TOAST values.
Even for uncompressed data, datatypes such as int and float can easily
print as more bytes than they occupy on-disk.

Given all the non-data overhead involved (eg for indexes), it's probably
unlikely that a text dump would exceed the "du" size of the database,
but it's far from "guaranteed".

It's my experience that when there are lots of numeric fields,
fixed-width text records are approx 2.5x larger than the original
binary records.

- --
Ron Johnson, Jr.
Jefferson LA USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFIZgvS9HxQb37XmcRAp6xAKC74LV+2wR6Ao5Oq56RInkkDP8PZgCglKEv
z0fvjrXTloWJJ7qdhfOpIoI=
=jICB
-----END PGP SIGNATURE-----

#9Jeff Davis
pgsql@j-davis.com
In reply to: Alexander Staubo (#7)
Re: Determining size of a database before dumping

On Tue, 2006-10-03 at 00:42 +0200, Alexander Staubo wrote:

Why does pg_dump serialize data less efficiently than PostgreSQL when
using the "custom" format? (Pg_dump arguably has greater freedom in
being able to apply space-saving optimizations to the output format.
For example, one could use table statistics to selectively apply
something like Rice coding for numeric data, or vertically decompose
the tuples and emit sorted vectors using delta compression.) As for
TOAST, should not pg_dump's compression compress just as well, or
better?

It would be a strange set of data that had a larger representation as a
compressed pg_dump than the data directory itself. However, one could
imagine a contrived case where that might happen.

Let's say you had a single table with 10,000 columns of type INT4, 100M
records, all with random numbers in the columns. I don't think standard
gzip compression will compress random INT4s down to 32 bits.

Another example is NULLs. What if only a few of those records had non-
NULL values? If I understand correctly, PostgreSQL will represent those
NULLs with just one bit.

What you're saying is more theoretical. If pg_dump used specialized
compression based on the data type of the columns, and everything was
optimal, you're correct. There's no situation in which the dump *must*
be bigger. However, since there is no practical demand for such
compression, and it would be a lot of work, there is no *guarantee* that
the data directory will be bigger. However, it probably is.

Regards,
Jeff Davis

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#9)
Re: Determining size of a database before dumping

Jeff Davis <pgsql@j-davis.com> writes:

On Tue, 2006-10-03 at 00:42 +0200, Alexander Staubo wrote:

Why does pg_dump serialize data less efficiently than PostgreSQL when
using the "custom" format?

What you're saying is more theoretical. If pg_dump used specialized
compression based on the data type of the columns, and everything was
optimal, you're correct. There's no situation in which the dump *must*
be bigger. However, since there is no practical demand for such
compression, and it would be a lot of work ...

There are several reasons for not being overly tense about the pg_dump
format:

* We don't have infinite manpower

* Cross-version and cross-platform portability of the dump files is
critical

* The more complicated it is, the more chance for bugs, which you'd
possibly not notice until you *really needed* that dump.

In practice, pushing the data through gzip gets most of the potential
win, for a very small fraction of the effort it would take to have a
smart custom compression mechanism.

regards, tom lane