Incremental backups, and backup history
Hi,
I have two backup questions, not much related to each other; here they
are.
First: With PostgreSQL, I can't do incremental backups. pg_dump will
dump the entire database. Thus, if I want to keep daily backups on tape,
I'm in trouble because I'll have to do a full backup every day, which
may need several hours and several tapes.
One workaround I'm thinking is to not store BLOBs in the database, but
store them in the filesystem and store the filenames in the database
instead. This needs some additional work (for example, a garbage
collector to periodically delete unreferenced files), but will move a
large amount of space from the database into the filesystem, which is
capable of incremental backups. Only BLOBs, that is; for some tables
that will have several tens of millions of small rows, I can't think of
any workaround.
Is this filenames-instead-of-BLOBs for easier backup common practice?
Any other ideas or comments?
My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.
A colleague suggested, instead of updating or deleting rows, to only
insert rows with a timestamp; for example, instead of updating the
employee row with id=55, you insert a new row with id=55 with the
updated data, and change the primary key to (id, dateinserted). You then
always select the row with max dateinserted. A garbage collector is also
needed to periodically delete obsolete rows older than, say, six months.
Improvements can be made (such as using dateobsoleted instead of
dateinserted or moving the old rows to another table), but even in the
simplest cases I think it will be extremely hard to implement such a
system, again because of the relations.
So, it is a matter of database design? Do I have to design the database
so that it keeps the history of what happened?
Thanks everyone for the answers.
On 19 Jun 2003 at 11:42, Antonios Christofides wrote:
Hi,
I have two backup questions, not much related to each other; here they
are.First: With PostgreSQL, I can't do incremental backups. pg_dump will
dump the entire database. Thus, if I want to keep daily backups on tape,
I'm in trouble because I'll have to do a full backup every day, which
may need several hours and several tapes.One workaround I'm thinking is to not store BLOBs in the database, but
store them in the filesystem and store the filenames in the database
instead. This needs some additional work (for example, a garbage
collector to periodically delete unreferenced files), but will move a
large amount of space from the database into the filesystem, which is
capable of incremental backups. Only BLOBs, that is; for some tables
that will have several tens of millions of small rows, I can't think of
any workaround.
A cheaper solution in terms of efforts is to add a latest update timestamp to
each blob and dump with a custom SQL.
Incremental backups would be real nice. If you have real high turnover, I would
suggest you to look asynchornous replication solution which work based on WAL
sengmenting. That way you would backup only things that are changed.
HTH
Bye
Shridhar
--
Blore's Razor: Given a choice between two theories, take the one which is
funnier.
On Thu, 2003-06-19 at 03:42, Antonios Christofides wrote:
Hi,
[snip]
My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.
One possibility that negates the need for "garbage collection" is
to create a history table that mirrors the primary table, and
also has an "activity_code char(1)", with values either "U" or "D".
Then an "before update" or "before delete" trigger would automatically
insert the pre-update-or-delete record into the history table, along
with U or D in activity_code.
Thus, you can look through the history table and find the accidentally
deleted records.
--
+-----------------------------------------------------------+
| Ron Johnson, Jr. Home: ron.l.johnson@cox.net |
| Jefferson, LA USA http://members.cox.net/ron.l.johnson |
| |
| "Oh, great altar of passive entertainment, bestow upon me |
| thy discordant images at such speed as to render linear |
| thought impossible" (Calvin, regarding TV) |
+-----------------------------------------------------------
On 19 Jun 2003 at 4:15, Ron Johnson wrote:
One possibility that negates the need for "garbage collection" is
to create a history table that mirrors the primary table, and
also has an "activity_code char(1)", with values either "U" or "D".Then an "before update" or "before delete" trigger would automatically
insert the pre-update-or-delete record into the history table, along
with U or D in activity_code.
That makes me think. You can write before insert/before update/before delete
triggers and use dblink in conjunction with that to mimick some kind of
mirroring.
Of course that is bad from keeping transactions in sync. But this remains an
option. If you put remote database connection in async mode in dblink module,
you can reduce the impact of remote operation as well.
HTH
Bye
Shridhar
--
Virtue is a relative term. -- Spock, "Friday's Child", stardate 3499.1
On Thu, Jun 19, 2003 at 11:42:28 +0300,
Antonios Christofides <A.Christofides@itia.ntua.gr> wrote:
My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.
This isn't a good situation to be in. If you log all of the transactions,
one option may be to restore the database as of the last time it was
valid, remove the bogus transactions from the transaction log and then replay
the transactions from the time of the backup to the present.
This will mean downtime for the database and it is possible that removing
the bogus tranasctions will affect things as you reapply other transactions
in ways you don't want.
Another option is to figure out what the bogus tranasctions did (which keeping
a history will help out with) and try to undo them as best as possible with
new transactions.
Either of the above are going to need manual oversight.
Antonios Christofides <A.Christofides@itia.ntua.gr> writes:
Is this filenames-instead-of-BLOBs for easier backup common practice?
Any other ideas or comments?
This is a major point of contention. Some people think keeping all data in the
database is a better approach, others think data that isn't inherently
relational and doesn't need the protection of transactions doesn't really
belong in the database.
I happen to be firmly in the camp against putting such files in the database.
But it depends a lot on what your needs are. In every case I've faced this it
was simply useful to have the files accessible without piping them through the
database protocol every time. I could point the web server at the directory
and serve them up directly, or rsync them to the machines that could do that.
They could even be served up from a separate lightweight web server without
any database access altogether, which would have been impossible if the only
way to access them was via the database.
If you need to be able to update portions of your blobs, or if you need
transactional safety then you may need the database.
My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.
What's worse is you may have several related changes to multiple tables. And
then you won't know if any other changes to other tables were later done that
depended on that data. There will be no way to do this perfectly in general.
If your data is particularly amenable to this form of normalization then it
can be useful though. For example, instead of storing counters that are
incremented, even when the total is the only interesting statistic, I normally
insert new records for every event. If ever it turns out something was wrong
and the events should be disregarded for a particular time period, or with
particular other properties, I have the complete history and can do that.
Inserts are also more efficient for the database to do than updates.
But for something like an employee table you're probably going to be stuck
with restoring the table to a new name, and having a human look over the old
data and the current data and updating the current data appropriately.
You may want to consider having a "deleted" flag column and not ever deleting
records. So at least undelete can be an easy operation that doesn't even need
restoring backups.
--
greg
On your second question:
Keeping old data helps with data analysis, i.e., data mining. I would do the fired date as transactions. To see if an employee is still and employee, look for the latest transation, hired, rehired, contracted with as a temp/consultant, fired, laid off, etc.
Antonios Christofides wrote:
Show quoted text
Hi,
I have two backup questions, not much related to each other; here they
are.First: With PostgreSQL, I can't do incremental backups. pg_dump will
dump the entire database. Thus, if I want to keep daily backups on tape,
I'm in trouble because I'll have to do a full backup every day, which
may need several hours and several tapes.One workaround I'm thinking is to not store BLOBs in the database, but
store them in the filesystem and store the filenames in the database
instead. This needs some additional work (for example, a garbage
collector to periodically delete unreferenced files), but will move a
large amount of space from the database into the filesystem, which is
capable of incremental backups. Only BLOBs, that is; for some tables
that will have several tens of millions of small rows, I can't think of
any workaround.Is this filenames-instead-of-BLOBs for easier backup common practice?
Any other ideas or comments?My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.A colleague suggested, instead of updating or deleting rows, to only
insert rows with a timestamp; for example, instead of updating the
employee row with id=55, you insert a new row with id=55 with the
updated data, and change the primary key to (id, dateinserted). You then
always select the row with max dateinserted. A garbage collector is also
needed to periodically delete obsolete rows older than, say, six months.
Improvements can be made (such as using dateobsoleted instead of
dateinserted or moving the old rows to another table), but even in the
simplest cases I think it will be extremely hard to implement such a
system, again because of the relations.So, it is a matter of database design? Do I have to design the database
so that it keeps the history of what happened?Thanks everyone for the answers.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
Regarding backup history:
I have an application designed for novices. Apparently it's easy to hit the
"Delete" button, and then say yes to the "Are you sure you want to delete
this?" question even when they don't want to. Therefore I simply mark a
record as deleted. For example,
UPDATE table SET deleted='t' WHERE something=true;
Then my application logic pretends it doesn't really exist until two days
later the user decides they want it back.
It works very well for me.
--
Matthew Nuzum
www.bearfruit.org
cobalt@bearfruit.org
Show quoted text
-----Original Message-----
From: Bruno Wolff III [mailto:bruno@wolff.to]
Sent: Thursday, June 19, 2003 6:59 AM
To: Antonios Christofides
Cc: pgsql-general@postgresql.org
Subject: Re: Incremental backups, and backup historyOn Thu, Jun 19, 2003 at 11:42:28 +0300,
Antonios Christofides <A.Christofides@itia.ntua.gr> wrote:My second question is a general relational database backup question, not
specifically related to pgsql. Sometimes a user accidentally
deletes/corrupts a file, and discovers it three days later. After they
come panicing to me, I can give them their file as it was three days
ago, because of the backup tape rotation. Now suppose a user deletes
ten employees from the database, and three days later they understand
that this was a bad thing. Now what? I can restore the entire database
and make it as it was three days ago, but I can't restore the particular
deleted records in the current database, as the relations make the
database behave as a single unit.This isn't a good situation to be in. If you log all of the transactions,
one option may be to restore the database as of the last time it was
valid, remove the bogus transactions from the transaction log and then
replay
the transactions from the time of the backup to the present.
This will mean downtime for the database and it is possible that removing
the bogus tranasctions will affect things as you reapply other
transactions
in ways you don't want.Another option is to figure out what the bogus tranasctions did (which
keeping
a history will help out with) and try to undo them as best as possible
with
new transactions.Either of the above are going to need manual oversight.
On Thu, 19 Jun 2003, Matthew Nuzum wrote:
Regarding backup history:
I have an application designed for novices. Apparently it's easy to hit the
"Delete" button, and then say yes to the "Are you sure you want to delete
this?" question even when they don't want to. Therefore I simply mark a
record as deleted. For example,
UPDATE table SET deleted='t' WHERE something=true;Then my application logic pretends it doesn't really exist until two days
later the user decides they want it back.It works very well for me.
But are you also taking care of the referential integrity issues, i.e. only
disallowing tuples with a deleted = true from being referenced to and ensuring
nothing references them at the time they are marked as deleted.
It is a useful idea but as I know from a current project it requires
reimplementing foreign key functionality. In this case the middleware only uses
functions, one per statement, and nothing else, so I have been able to do much
of this in those functions but it's still a pain. I even wrote a utility to
take some of the leg work out of generating and maintaining quite a few
functions but if I'd had time [and thought about these basically being foreign
key constraints] I'd have looked at the existing foreign key code and seen if I
could copy and amend it or just amend it in place.
--
Nigel Andrews
that's a good point, ref integrity and 'deleted' items. I'll have to take a look at that as I make my next design. I'm surpirsed that I didn't think of it. But I probably would have experienced it soon, as I am getting ready to put data in the design I'm on now.
One way I know that makes it all easier, is to use surrogate integer keys on all tables, i.e. sequences, as the primary key.
Nigel J. Andrews wrote:
Show quoted text
On Thu, 19 Jun 2003, Matthew Nuzum wrote:
Regarding backup history:
I have an application designed for novices. Apparently it's easy to hit the
"Delete" button, and then say yes to the "Are you sure you want to delete
this?" question even when they don't want to. Therefore I simply mark a
record as deleted. For example,
UPDATE table SET deleted='t' WHERE something=true;Then my application logic pretends it doesn't really exist until two days
later the user decides they want it back.It works very well for me.
But are you also taking care of the referential integrity issues, i.e. only
disallowing tuples with a deleted = true from being referenced to and ensuring
nothing references them at the time they are marked as deleted.It is a useful idea but as I know from a current project it requires
reimplementing foreign key functionality. In this case the middleware only uses
functions, one per statement, and nothing else, so I have been able to do much
of this in those functions but it's still a pain. I even wrote a utility to
take some of the leg work out of generating and maintaining quite a few
functions but if I'd had time [and thought about these basically being foreign
key constraints] I'd have looked at the existing foreign key code and seen if I
could copy and amend it or just amend it in place.--
Nigel Andrews---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
I think somebody already suggested the following solution, which would
solve referential integrity problems:
- create a "history" table with the exact layout of the working table,
but without it's FK constraints, and optionally with an insertion date;
- create a trigger on the working table so that it inserts in the
history table the current state on update/delete/(maybe insert);
Now when you delete a row from the working table, you have all the
benefits of FKs, and the history table provides all the history info you
need.
Recovery would be possible only manually copying back from the history
table, with possible FK hassles, but the data would be there.
Cheers,
Csaba.
Show quoted text
On Fri, 2003-06-20 at 18:28, Dennis Gearon wrote:
that's a good point, ref integrity and 'deleted' items. I'll have to take a look at that as I make my next design. I'm surpirsed that I didn't think of it. But I probably would have experienced it soon, as I am getting ready to put data in the design I'm on now.
One way I know that makes it all easier, is to use surrogate integer keys on all tables, i.e. sequences, as the primary key.
Nigel J. Andrews wrote:
On Thu, 19 Jun 2003, Matthew Nuzum wrote:
Regarding backup history:
I have an application designed for novices. Apparently it's easy to hit the
"Delete" button, and then say yes to the "Are you sure you want to delete
this?" question even when they don't want to. Therefore I simply mark a
record as deleted. For example,
UPDATE table SET deleted='t' WHERE something=true;Then my application logic pretends it doesn't really exist until two days
later the user decides they want it back.It works very well for me.
But are you also taking care of the referential integrity issues, i.e. only
disallowing tuples with a deleted = true from being referenced to and ensuring
nothing references them at the time they are marked as deleted.It is a useful idea but as I know from a current project it requires
reimplementing foreign key functionality. In this case the middleware only uses
functions, one per statement, and nothing else, so I have been able to do much
of this in those functions but it's still a pain. I even wrote a utility to
take some of the leg work out of generating and maintaining quite a few
functions but if I'd had time [and thought about these basically being foreign
key constraints] I'd have looked at the existing foreign key code and seen if I
could copy and amend it or just amend it in place.--
Nigel Andrews---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings