a back up question

Started by Martin Muellerover 8 years ago15 messagesgeneral

martinmueller@northwestern.edu

over 8 years ago

Are there rules for thumb for deciding when you can dump a whole database and when you’d be better off dumping groups of tables? I have a database that has around 100 tables, some of them quite large, and right now the data directory is well over 100GB. My hunch is that I should divide and conquer, but I don’t have a clear sense of what counts as “too big” these days. Nor do I have a clear sense of whether the constraints have to do with overall size, the number of tables, or machine memory (my machine has 32GB of memory).

Is 10GB a good practical limit to keep in mind?

David G. Johnston

david.g.johnston@gmail.com

over 8 years ago

In reply to: Martin Mueller (#1)

Re: a back up question

On Tue, Dec 5, 2017 at 2:52 PM, Martin Mueller <
martinmueller@northwestern.edu> wrote:

Are there rules for thumb for deciding when you can dump a whole database
and when you’d be better off dumping groups of tables? I have a database
that has around 100 tables, some of them quite large, and right now the
data directory is well over 100GB. My hunch is that I should divide and
conquer, but I don’t have a clear sense of what counts as “too big” these
days. Nor do I have a clear sense of whether the constraints have to do
with overall size, the number of tables, or machine memory (my machine has
32GB of memory).

Is 10GB a good practical limit to keep in mind?

I'd say the rule-of-thumb is if you have to "divide-and-conquer" you
should use non-pg_dump based backup solutions. Too big is usually measured
in units of time, not memory.

Any ability to partition your backups into discrete chunks is going to be
very specific to your personal setup. Restoring such a monster without
constraint violations is something I'd be VERY worried about.

David J.

Carl Karsten

carl@personnelware.com

over 8 years ago

In reply to: Martin Mueller (#1)

Re: a back up question

Nothing wrong with lots of tables and data.

Don't impose any constraints on your problem you don't need to.

Like what are you backing up to? $400 for a 1T ssd or $80 fo a 2T usb3
spinny disk.

If you are backing up while the db is being updated, you need to make sure
updates are queued until the backup is done. don't mess with that
process. personally I would assume the db is always being updated and
expect that.

On Tue, Dec 5, 2017 at 3:52 PM, Martin Mueller <
martinmueller@northwestern.edu> wrote:

Are there rules for thumb for deciding when you can dump a whole database
and when you’d be better off dumping groups of tables? I have a database
that has around 100 tables, some of them quite large, and right now the
data directory is well over 100GB. My hunch is that I should divide and
conquer, but I don’t have a clear sense of what counts as “too big” these
days. Nor do I have a clear sense of whether the constraints have to do
with overall size, the number of tables, or machine memory (my machine has
32GB of memory).

Is 10GB a good practical limit to keep in mind?

--
Carl K

Martin Mueller

martinmueller@northwestern.edu

over 8 years ago

In reply to: David G. Johnston (#2)

Re: a back up question

Time is not really a problem for me, if we talk about hours rather than days. On a roughly comparable machine I’ve made backups of databases less than 10 GB, and it was a matter of minutes. But I know that there are scale problems. Sometimes programs just hang if the data are beyond some size. Is that likely in Postgres if you go from ~ 10 GB to ~100 GB? There isn’t any interdependence among my tables beyond queries I construct on the fly, because I use the database in a single user environment

From: "David G. Johnston" <david.g.johnston@gmail.com>
Date: Tuesday, December 5, 2017 at 3:59 PM
To: Martin Mueller <martinmueller@northwestern.edu>
Cc: "pgsql-general@lists.postgresql.org" <pgsql-general@lists.postgresql.org>
Subject: Re: a back up question

On Tue, Dec 5, 2017 at 2:52 PM, Martin Mueller <martinmueller@northwestern.edu<mailto:martinmueller@northwestern.edu>> wrote:
Are there rules for thumb for deciding when you can dump a whole database and when you’d be better off dumping groups of tables? I have a database that has around 100 tables, some of them quite large, and right now the data directory is well over 100GB. My hunch is that I should divide and conquer, but I don’t have a clear sense of what counts as “too big” these days. Nor do I have a clear sense of whether the constraints have to do with overall size, the number of tables, or machine memory (my machine has 32GB of memory).

Is 10GB a good practical limit to keep in mind?

I'd say the rule-of-thumb is if you have to "divide-and-conquer" you should use non-pg_dump based backup solutions. Too big is usually measured in units of time, not memory.

Any ability to partition your backups into discrete chunks is going to be very specific to your personal setup. Restoring such a monster without constraint violations is something I'd be VERY worried about.

David J.

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Carl Karsten (#3)

Re: a back up question

Carl Karsten wrote:

Nothing wrong with lots of tables and data.

Don't impose any constraints on your problem you don't need to.

Like what are you backing up to? $400 for a 1T ssd or $80 fo a 2T usb3
spinny disk.

If you are backing up while the db is being updated, you need to make sure
updates are queued until the backup is done. don't mess with that
process. personally I would assume the db is always being updated and
expect that.

A backup generated by pg_dump never includes writes that are in flight
while the backup is being taken. That would make the backup absolutely
worthless!

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David G. Johnston

david.g.johnston@gmail.com

over 8 years ago

In reply to: Martin Mueller (#4)

Re: a back up question

On Tue, Dec 5, 2017 at 3:09 PM, Martin Mueller <
martinmueller@northwestern.edu> wrote:

Time is not really a problem for me, if we talk about hours rather than
days. On a roughly comparable machine I’ve made backups of databases less
than 10 GB, and it was a matter of minutes. But I know that there are
scale problems. Sometimes programs just hang if the data are beyond some
size. Is that likely in Postgres if you go from ~ 10 GB to ~100 GB? There
isn’t any interdependence among my tables beyond queries I construct on
the fly, because I use the database in a single user environment

The convention on these lists is to inline and/or bottom-post; please
avoid top-posting.

That you are using a relational database system to house tables without any
interdependence (relationships) between them is an interesting
proposition. That you are in a "single user environment" in most cases
would have no impact on this...

PostgreSQL itself, bugs not withstanding, won't "hang" no matter how much
data is being processed. It does, however, take out locks so that the
entire dump represents that exact same snapshot for all dumped objects.
Those locks can impact queries. In particular using "TRUNCATE" becomes
pretty much impossible while a dump backup is in progress (I get bit by
this, I tend to truncate unlogged tables quite a bit in my usage of
PostgreSQL). Normal updates and selects usually work without problem
though any transactions started after the backup will not be part of the
output no matter how long after the transaction closes the backup finishes.

I suspect that typically you will end up annoyed at how long the backup
takes well before any program/system issues become apparent. Data is
streamed to the output file handle so active memory usage and database size
are not really correlated.

David J.

Carl Karsten

carl@personnelware.com

over 8 years ago

In reply to: Alvaro Herrera (#5)

Re: a back up question

On Tue, Dec 5, 2017 at 4:15 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Carl Karsten wrote:

Nothing wrong with lots of tables and data.

Don't impose any constraints on your problem you don't need to.

Like what are you backing up to? $400 for a 1T ssd or $80 fo a 2T usb3
spinny disk.

If you are backing up while the db is being updated, you need to make

sure

updates are queued until the backup is done. don't mess with that
process. personally I would assume the db is always being updated and
expect that.

A backup generated by pg_dump never includes writes that are in flight
while the backup is being taken. That would make the backup absolutely
worthless!

Hmm, i kinda glossed over my point:
if you come up with your own process to chop up the backup into little
pieces, you risk letting writes in, and then yeah, worthless.

--
Carl K

John R Pierce

pierce@hogranch.com

over 8 years ago

In reply to: Martin Mueller (#4)

Re: a back up question

On 12/5/2017 2:09 PM, Martin Mueller wrote:

Time is not really a problem for me, if we talk about hours rather
than days. On a roughly comparable machine I’ve made backups of
databases less than 10 GB, and it was a matter of minutes. But I know
that there are scale problems. Sometimes programs just hang if the
data are beyond some size. Is that likely in Postgres if you go from
~ 10 GB to ~100 GB? There isn’t any interdependence among my tables
beyond queries I construct on the fly, because I use the database in
a single user environment

another factor is restore time. restores have to create indexes.
creating indexes on multi-million-row tables can take awhile. (hint, be
sure to set maintenance_work_mem to 1GB before doing this!)

--
john r pierce, recycling bits in santa cruz

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Carl Karsten (#7)

Re: a back up question

Carl Karsten wrote:

On Tue, Dec 5, 2017 at 4:15 PM, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

A backup generated by pg_dump never includes writes that are in flight
while the backup is being taken. That would make the backup absolutely
worthless!

Hmm, i kinda glossed over my point:
if you come up with your own process to chop up the backup into little
pieces, you risk letting writes in, and then yeah, worthless.

Ah, sure.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10

Karsten Hilbert

Karsten.Hilbert@gmx.net

over 8 years ago

In reply to: Martin Mueller (#1)

Re: a back up question

On Tue, Dec 05, 2017 at 09:52:28PM +0000, Martin Mueller wrote:

Are there rules for thumb for deciding when you can dump a
whole database and when you’d be better off dumping groups of
tables?

It seems to me we'd have to define the objective of "dumping" first ?

Regards,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#11

Martin Mueller

martinmueller@northwestern.edu

over 8 years ago

In reply to: Karsten Hilbert (#10)

Re: a back up question

On 12/6/17, 4:39 AM, "karsten.hilbert@gmx.net" <karsten.hilbert@gmx.net> wrote:

On Tue, Dec 05, 2017 at 09:52:28PM +0000, Martin Mueller wrote:

Are there rules for thumb for deciding when you can dump a
whole database and when you’d be better off dumping groups of
tables?

It seems to me we'd have to define the objective of "dumping" first ?

Regards,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

The objective is to create a backup from which I can restore any or all tables in the event of a crash. In my case, I use Postgres for my own scholarly purposes. Publications of whatever kind are not directly made public via the database. I am my only customer, and a service interruption, while a nuisance to me, does not create a crisis for others. I don’t want to lose my work, but a service interruption of a day or a week is no big deal.

#12

Karsten Hilbert

Karsten.Hilbert@gmx.net

over 8 years ago

In reply to: Martin Mueller (#11)

Re: a back up question

On Wed, Dec 06, 2017 at 12:52:53PM +0000, Martin Mueller wrote:

Are there rules for thumb for deciding when you can dump a
whole database and when you’d be better off dumping groups of
tables?

It seems to me we'd have to define the objective of "dumping" first ?

The objective is to create a backup from which I can
restore any or all tables in the event of a crash.

I see.

"Any or all" speaks in recommendation of non-plain output
formats _if_ using pg_dump.

In my case, I use Postgres for my own scholarly purposes.
Publications of whatever kind are not directly made public
via the database. I am my only customer, and a service
interruption, while a nuisance to me, does not create a
crisis for others. I don’t want to lose my work, but a
service interruption of a day or a week is no big deal.

In that case I would stick to pg_dump, perhaps with directory
format and then tarred and compressed, until you notice
actual problems (unbearable slowdown of the machine during
backup, running out of disk space).

My 2 cents,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#13

Stephen Frost

sfrost@snowman.net

over 8 years ago

In reply to: John R Pierce (#8)

Re: a back up question

John, all,

* John R Pierce (pierce@hogranch.com) wrote:

On 12/5/2017 2:09 PM, Martin Mueller wrote:

Time is not really a problem for me, if we talk about hours rather
than days. On a roughly comparable machine I’ve made backups of
databases less than 10 GB, and it was a matter of minutes. But I
know that there are scale problems. Sometimes programs just hang
if the data are beyond some size. Is that likely in Postgres if
you go from ~ 10 GB to ~100 GB? There isn’t any interdependence
among my tables beyond queries I construct on the fly, because I
use the database in a single user environment

another factor is restore time. restores have to create
indexes. creating indexes on multi-million-row tables can take
awhile. (hint, be sure to set maintenance_work_mem to 1GB before
doing this!)

I'm sure you're aware of this John, but for others following along, just
to be clear: indexes have to be recreated when restoring from a
*logical* (eg: pg_dump based) backups. Indexes don't have to be
recreated for *physical* (eg: file-based) backups.

Neither pg_dump nor the various physical-backup utilities should hang or
have issues with larger data sets.

Thanks!

Stephen

#14

Vick Khera

vivek@khera.org

over 8 years ago

In reply to: Martin Mueller (#11)

Re: a back up question

On Wed, Dec 6, 2017 at 7:52 AM, Martin Mueller <
martinmueller@northwestern.edu> wrote:

The objective is to create a backup from which I can restore any or all
tables in the event of a crash. In my case, I use Postgres for my own
scholarly purposes. Publications of whatever kind are not directly made
public via the database. I am my only customer, and a service interruption,
while a nuisance to me, does not create a crisis for others. I don’t want
to lose my work, but a service interruption of a day or a week is no big
deal.

I'd stick with pg_dump for sure. Two main choices depending on how big your
database is and how fast your disks are: 1) "c" format into a single flat
compressed file from which you can restore; 2) "d" format which you would
then subsequently need to compress and tar for easy tracking and off-site
copying. The only real advantage to "d" format is that you can parallelize
the dumps if you have enough spare I/O bandwidth.

For my backups on a production database serving thousands of customers per
day (mostly in the US) on a web app, I just did a "c" format pg_dump
nightly around 3am US Eastern time. It was our low time, and the impact on
the database server was not significant since it had more RAM than the size
of the database on disk (256GB RAM vs 100GB disk size including indexes).
The backups are on a different machine which connects via LAN to the DB
server and writes to its own local disk then copied that to an off-site
server. Before I had such beefy hardware, I would do the dump from a
replica which was updated using Slony1 software. The pg_dump backups were
for disaster recovery and customer error recovery, so I kept about 2 weeks'
worth of them.

Since you have no other consumers of your data, just use a simple "c"
format dump however often you like, then copy those off-site. Easy peasy.

#15

Magnus Hagander

magnus@hagander.net

over 8 years ago

In reply to: Martin Mueller (#11)

Re: a back up question

On Wed, Dec 6, 2017 at 9:52 PM, Martin Mueller <
martinmueller@northwestern.edu> wrote:

On 12/6/17, 4:39 AM, "karsten.hilbert@gmx.net" <karsten.hilbert@gmx.net>
wrote:

On Tue, Dec 05, 2017 at 09:52:28PM +0000, Martin Mueller wrote:

Are there rules for thumb for deciding when you can dump a
whole database and when you’d be better off dumping groups of
tables?

It seems to me we'd have to define the objective of "dumping" first ?

Regards,
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

The objective is to create a backup from which I can restore any or all
tables in the event of a crash. In my case, I use Postgres for my own
scholarly purposes. Publications of whatever kind are not directly made
public via the database. I am my only customer, and a service interruption,
while a nuisance to me, does not create a crisis for others. I don’t want
to lose my work, but a service interruption of a day or a week is no big
deal.

If you reach the point where you have to consider splitting up the dumps
for performance reasons, then you have really reached the point where
pg_dump just isn't good enough for backups anymore. There are good uses for
pg_dump even on such large databases, but backups aren't one of them.

You should then instead use pg_basebackup, or if you need even more
functionality and performance than this provides, look at the external
tools like pgbackrest or pgbarman. These tools don't need to be any more
complicated to use than pg_dump, but will give you a much better backup.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>