incremental dumps

Started by Nonameover 12 years ago11 messagesgeneral
Jump to latest
#1Noname
hamann.w@t-online.de

Hi,
I want to store copies of our data on a remote machine as a security measure.
My first attempt was a full dump (which takes too long to upload)
followed by diffs between the pgdump files.
This provides readable / searchable versioned data (I could alway apply
the diffs on the remote machine and search the text file, without having
an instance of postgres running on that machine)

However, the diff files seem to be considerably larger than one would expect.
One obvious part of the problem is the fact that diff shows old and new text,
so e.g. changing the amount of stock for a product with a 1kB description
would generate at least 2kb of text in the diff file.

What would be a better approach?

Best regards
Wolfgang

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Luca Ferrari
fluca1978@infinito.it
In reply to: Noname (#1)
Re: incremental dumps

On Thu, Aug 1, 2013 at 10:59 AM, <hamann.w@t-online.de> wrote:

However, the diff files seem to be considerably larger than one would expect.
One obvious part of the problem is the fact that diff shows old and new text,
so e.g. changing the amount of stock for a product with a 1kB description
would generate at least 2kb of text in the diff file.

What would be a better approach?

I suppose wal archiving or PITR would be better, but assuming you want
text files I guess you need to change your database structure to
either:
1) include a watermark on data and dump only data since the last dump
(to do manually);
2) partition your tables and backup specific tables/partitions
depending on the timing.

Hope this helps.
Luca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Bèrto ëd Sèra
berto.d.sera@gmail.com
In reply to: Luca Ferrari (#2)
Re: incremental dumps

suppose wal archiving or PITR would be better

+1, never re-invent the wheel, unless you really need to.

Bèrto

On 1 August 2013 14:14, Luca Ferrari <fluca1978@infinito.it> wrote:

On Thu, Aug 1, 2013 at 10:59 AM, <hamann.w@t-online.de> wrote:

However, the diff files seem to be considerably larger than one would

expect.

One obvious part of the problem is the fact that diff shows old and new

text,

so e.g. changing the amount of stock for a product with a 1kB description
would generate at least 2kb of text in the diff file.

What would be a better approach?

I suppose wal archiving or PITR would be better, but assuming you want
text files I guess you need to change your database structure to
either:
1) include a watermark on data and dump only data since the last dump
(to do manually);
2) partition your tables and backup specific tables/partitions
depending on the timing.

Hope this helps.
Luca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a
darkened room munching pills and listening to repetitive music.

#4Martin Collins
martin@mkcollins.org
In reply to: Noname (#1)
Re: incremental dumps

On 08/01/2013 02:59 AM, hamann.w@t-online.de wrote:

However, the diff files seem to be considerably larger than one would expect.
One obvious part of the problem is the fact that diff shows old and new text,

You could try using
diff --suppress-common-lines -ed
which in my experience creates the smallest diffs (actually ed scripts).
Apply them with
patch -e

Martin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Noname
hamann.w@t-online.de
In reply to: Martin Collins (#4)
Re: incremental dumps

On 08/01/2013 02:59 AM, hamann.w@t-online.de wrote:

However, the diff files seem to be considerably larger than one would expect.
One obvious part of the problem is the fact that diff shows old and new text,

You could try using
diff --suppress-common-lines -ed
which in my experience creates the smallest diffs (actually ed scripts).
Apply them with
patch -e

Martin

Hi Martin,

thanks for the hint - this is probably one of the things to do.
I have something else in mind, but at present I just suspect that this might happen:
when I modify data and select _without an ordering_, I am pretty sure to get the data
in a different sequence than before. So I wonder whethet forcing the dump to honor
a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the
time diff takes to produce them

Regards
Wolfgang Hamann

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Luca Ferrari
fluca1978@infinito.it
In reply to: Noname (#5)
Re: incremental dumps

On Fri, Aug 2, 2013 at 6:55 PM, <hamann.w@t-online.de> wrote:

thanks for the hint - this is probably one of the things to do.
I have something else in mind, but at present I just suspect that this might happen:
when I modify data and select _without an ordering_, I am pretty sure to get the data
in a different sequence than before. So I wonder whethet forcing the dump to honor
a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the
time diff takes to produce them

May I ask what is the final purpose? Because if it is to take a backup
I guess this is not the right way to go, while if it is keeping (and
rebuilding) an history of data, than using a more specific approach
(like logging) could give you less headaches.

Luca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Noname
hamann.w@t-online.de
In reply to: Luca Ferrari (#6)
Re: incremental dumps

Luca Ferrari wrote:
On Fri, Aug 2, 2013 at 6:55 PM, <hamann.w@t-online.de> wrote:

thanks for the hint - this is probably one of the things to do.
I have something else in mind, but at present I just suspect that this might happen:
when I modify data and select _without an ordering_, I am pretty sure to get the data
in a different sequence than before. So I wonder whethet forcing the dump to honor
a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the
time diff takes to produce them

May I ask what is the final purpose? Because if it is to take a backup
I guess this is not the right way to go, while if it is keeping (and
rebuilding) an history of data, than using a more specific approach
(like logging) could give you less headaches.

Luca
--------------
Hi Luca,

we recently decided to have off-site backups rather than burning piles of DVDs that are kept
on-site. The backup server sits in a data center and is fed nightly via rsync.
The link is not too fast.
One thought in favor of text files: if disaster really strikes (the regular machine goes on fire)
it is quite likely that a replacement would be installed with latest versions of all software.
Now, if I had binary files, I would probably have to install the old version of the software
just to be able to do a regular dump and then reload into newer one
With the planned setup, I would be able to look up previous states of the database as well.
(Sample scenario: when was the price of product xyz actually changed?)
This is likely not too convenient ... but loading successive dumps into a secondary installation
of the database is definitely worse.

Regards
Wolfgang

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Noname (#1)
Re: incremental dumps

On Thu, Aug 1, 2013 at 1:59 AM, <hamann.w@t-online.de> wrote:

Hi,
I want to store copies of our data on a remote machine as a security measure.

Can you describe what your security concerns are? Are you worried
about long-lasting malicious tampering with the data that you need to
be able to recover from? Simple loss of data from natural disaster?

My first attempt was a full dump (which takes too long to upload)
followed by diffs between the pgdump files.
This provides readable / searchable versioned data (I could alway apply
the diffs on the remote machine and search the text file, without having
an instance of postgres running on that machine)

I think that optimizing with the intention of not using PostgreSQL is
probably the wrong approach. You find it valuable to use PostgreSQL
on your production server, why would you not also find it valuable to
use it on the remote? I like the ability to use pg_dump to get human
readable data, and I use it often. But I usually start with a binary
image recovered to the point I want, and then take a "fresh" pg_dump
out of that for inspection, rather than trying to save pg_dumps at
every time that might be of interest.

However, the diff files seem to be considerably larger than one would expect.
One obvious part of the problem is the fact that diff shows old and new text,
so e.g. changing the amount of stock for a product with a 1kB description
would generate at least 2kb of text in the diff file.

Usually a 1kb product description would not be in the same table as a
current stock count would. Anyway, there are diff tools that are not
line-oriented which could compress well on this type of data, but if
the rows are not in the same order between dumps, they do poorly.
There is some more discussion of this here:

/messages/by-id/CAMkU=1z0+=M-2g-N3+y=d-QJu-qcYTGXbw_h1E5g5Gr+rCny6A@mail.gmail.com

Cheers,

Jeff

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Michael Nolan
htfoot@gmail.com
In reply to: Noname (#1)
Re: incremental dumps

On 8/1/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote:

Hi,
I want to store copies of our data on a remote machine as a security
measure.

Wolfgang

2 questions:

1. How secure is the remote site?
2. How much data are we talking about?
--
Mike Nolan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Noname
hamann.w@t-online.de
In reply to: Michael Nolan (#9)
Re: incremental dumps

On 8/1/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote:

Hi,
I want to store copies of our data on a remote machine as a security
measure.

Wolfgang

2 questions:

1. How secure is the remote site?
2. How much data are we talking about?
--
Mike Nolan

Hi Mike,

currently the source uses some 20 GB in a database partition and about 700 GB
in a general data partition. For the database, a diff -e grows to about 10% of the size
of a full dump in a week
The remote site is a raid box at a hosting center, with paid backup

Regards
Wolfgang

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Michael Nolan
htfoot@gmail.com
In reply to: Noname (#10)
Re: incremental dumps

On 8/10/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote:

currently the source uses some 20 GB in a database partition and about 700
GB
in a general data partition. For the database, a diff -e grows to about 10%
of the size
of a full dump in a week
The remote site is a raid box at a hosting center, with paid backup

Regards
Wolfgang

It sounds like you have catastrophic failure covered, but what about
data integrity and data security?

You may need to 'roll your own' solution, possibly using something like Slony.

Having a timestamp field that indicates when the row was inserted or
last updated may help.

A true incremental backup would IMHO be a very useful tool for
database administrators, but there are a number of technical
challenges involved, especially dealing with deleted records.
--
Mike Nolan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general