Interesting post-mortem on a near disaster with git

Started by Tom Lanealmost 13 years ago11 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

We should think about protecting our own repo a bit better, especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#1)
Re: Interesting post-mortem on a near disaster with git

On Sun, Mar 24, 2013 at 11:52:17AM -0400, Tom Lane wrote:

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

We should think about protecting our own repo a bit better, especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

I think the lesson here is that a mirror is not a backup. RAID, ZFS,
and version control are all not backups.

Taking a tarball of the entire repository and storing it on a different
machine would solve just about any problem you can think of in this
area.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#3Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Martijn van Oosterhout (#2)
Re: Interesting post-mortem on a near disaster with git

On 03/24/2013 05:08 PM, Martijn van Oosterhout wrote:

On Sun, Mar 24, 2013 at 11:52:17AM -0400, Tom Lane wrote:

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

We should think about protecting our own repo a bit better, especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

I think the lesson here is that a mirror is not a backup. RAID, ZFS,
and version control are all not backups.

Taking a tarball of the entire repository and storing it on a different
machine would solve just about any problem you can think of in this
area.

fwiw - the sysadmin team has file-level backups of all pginfra hosts
(two backups/day, one per day for a week and a full per week for 4 weeks
of history).

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#1)
Re: Interesting post-mortem on a near disaster with git

On Mon, Mar 25, 2013 at 12:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

It is really great that KDE people are actually sharing this experience.
This is really profitable for other projects as well as individuals.
And thanks for sharing it here.

We should think about protecting our own repo a bit better, especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

Just an idea here: why not adding a new subdomain in postgresql.org for
mirrors of the official GIT repository
similar to the buildfarm?
People registered in this service could publish themselves mirrors and
decide by themselves the delay their
clone keeps with the parent repo. The scripts used by each mirror
maintainer (for backup, sync repo with
a given delay) could be centralized in a way similar to buildfarm code so
as everybody in the community could
use it and publish it if they want.

Also, the mirrors published should be maintained by people that are
well-known inside the community,
and who would not add extra commits which would make the mirror out-of-sync
with the parent repo.

Such an idea is perhaps too much if the point is to maintain 2-3 mirrors of
the parent repo, but gives
enough transparency to actually know where the mirrors are and what is the
sync delay maintained.
--
Michael

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#4)
Re: Interesting post-mortem on a near disaster with git

On 03/24/2013 06:06 PM, Michael Paquier wrote:

On Mon, Mar 25, 2013 at 12:52 AM, Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

It is really great that KDE people are actually sharing this
experience. This is really profitable for other projects as well as
individuals.
And thanks for sharing it here.

We should think about protecting our own repo a bit better, especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

Just an idea here: why not adding a new subdomain in postgresql.org
<http://postgresql.org&gt; for mirrors of the official GIT repository
similar to the buildfarm?
People registered in this service could publish themselves mirrors and
decide by themselves the delay their
clone keeps with the parent repo. The scripts used by each mirror
maintainer (for backup, sync repo with
a given delay) could be centralized in a way similar to buildfarm code
so as everybody in the community could
use it and publish it if they want.

Also, the mirrors published should be maintained by people that are
well-known inside the community,
and who would not add extra commits which would make the mirror
out-of-sync with the parent repo.

Such an idea is perhaps too much if the point is to maintain 2-3
mirrors of the parent repo, but gives
enough transparency to actually know where the mirrors are and what is
the sync delay maintained.

This strikes me as being overkill. The sysadmins seem to have it covered.

Back when we used CVS for quite a few years I kept 7 day rolling
snapshots of the CVS repo, against just such a difficulty as this. But
we seem to be much better organized with infrastructure these days so I
haven't done that for a long time.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Andrew Dunstan (#5)
Re: Interesting post-mortem on a near disaster with git

On 03/24/2013 11:22 PM, Andrew Dunstan wrote:

On 03/24/2013 06:06 PM, Michael Paquier wrote:

On Mon, Mar 25, 2013 at 12:52 AM, Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

It is really great that KDE people are actually sharing this
experience. This is really profitable for other projects as well as
individuals.
And thanks for sharing it here.

We should think about protecting our own repo a bit better,
especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

Just an idea here: why not adding a new subdomain in postgresql.org
<http://postgresql.org&gt; for mirrors of the official GIT repository
similar to the buildfarm?
People registered in this service could publish themselves mirrors and
decide by themselves the delay their
clone keeps with the parent repo. The scripts used by each mirror
maintainer (for backup, sync repo with
a given delay) could be centralized in a way similar to buildfarm code
so as everybody in the community could
use it and publish it if they want.

Also, the mirrors published should be maintained by people that are
well-known inside the community,
and who would not add extra commits which would make the mirror
out-of-sync with the parent repo.

Such an idea is perhaps too much if the point is to maintain 2-3
mirrors of the parent repo, but gives
enough transparency to actually know where the mirrors are and what is
the sync delay maintained.

This strikes me as being overkill. The sysadmins seem to have it covered.

Back when we used CVS for quite a few years I kept 7 day rolling
snapshots of the CVS repo, against just such a difficulty as this. But
we seem to be much better organized with infrastructure these days so I
haven't done that for a long time.

well there is always room for improvement(and for learning from others)
- but I agree that this proposal seems way overkill. If people think we
should keep online "delayed" mirrors we certainly have the resources to
do that on our own if we want though...

Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Magnus Hagander
magnus@hagander.net
In reply to: Stefan Kaltenbrunner (#6)
Re: Interesting post-mortem on a near disaster with git

On Mon, Mar 25, 2013 at 7:07 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

On 03/24/2013 11:22 PM, Andrew Dunstan wrote:

On 03/24/2013 06:06 PM, Michael Paquier wrote:

On Mon, Mar 25, 2013 at 12:52 AM, Tom Lane <tgl@sss.pgh.pa.us
<mailto:tgl@sss.pgh.pa.us>> wrote:

Over the weekend, KDE came within a gnat's eyelash of losing *all*
their authoritative git repos, despite having seemingly-extensive
redundancy. Read about it here:
http://jefferai.org/2013/03/24/too-perfect-a-mirror/

It is really great that KDE people are actually sharing this
experience. This is really profitable for other projects as well as
individuals.
And thanks for sharing it here.

We should think about protecting our own repo a bit better,
especially
after the recent unpleasantness with a bogus forced update. The idea
of having clones that are deliberately a day or two behind seems
attractive ...

Just an idea here: why not adding a new subdomain in postgresql.org
<http://postgresql.org&gt; for mirrors of the official GIT repository
similar to the buildfarm?
People registered in this service could publish themselves mirrors and
decide by themselves the delay their
clone keeps with the parent repo. The scripts used by each mirror
maintainer (for backup, sync repo with
a given delay) could be centralized in a way similar to buildfarm code
so as everybody in the community could
use it and publish it if they want.

Also, the mirrors published should be maintained by people that are
well-known inside the community,
and who would not add extra commits which would make the mirror
out-of-sync with the parent repo.

Such an idea is perhaps too much if the point is to maintain 2-3
mirrors of the parent repo, but gives
enough transparency to actually know where the mirrors are and what is
the sync delay maintained.

This strikes me as being overkill. The sysadmins seem to have it covered.

Back when we used CVS for quite a few years I kept 7 day rolling
snapshots of the CVS repo, against just such a difficulty as this. But
we seem to be much better organized with infrastructure these days so I
haven't done that for a long time.

well there is always room for improvement(and for learning from others)
- but I agree that this proposal seems way overkill. If people think we
should keep online "delayed" mirrors we certainly have the resources to
do that on our own if we want though...

Yeah, definitely.

It's also interesting to note that one of the things they do is to
"stop using mirrored clones". The fact that we *don't* use mirrored
clones for our anon repository is exectly how we caught the issue
caused by the invalid push that Kevin did a short while ago...

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Daniel Farina
daniel@heroku.com
In reply to: Stefan Kaltenbrunner (#6)
Re: Interesting post-mortem on a near disaster with git

On Mon, Mar 25, 2013 at 11:07 AM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:

Back when we used CVS for quite a few years I kept 7 day rolling
snapshots of the CVS repo, against just such a difficulty as this. But
we seem to be much better organized with infrastructure these days so I
haven't done that for a long time.

well there is always room for improvement(and for learning from others)
- but I agree that this proposal seems way overkill. If people think we
should keep online "delayed" mirrors we certainly have the resources to
do that on our own if we want though...

What about rdiff-backup? I've set it up for personal use years ago
(via the handy open source bash script backupninja) years ago and it
has a pretty nice no-frills point-in-time, self-expiring, file-based
automatic backup program that works well with file synchronization
like rsync (I rdiff-backup to one disk and rsync the entire
rsync-backup output to another disk). I've enjoyed using it quite a
bit during my own personal-computer emergencies and thought the
maintenance required from me has been zero, and I have used it from
time to time to restore, proving it even works. Hardlinks can be used
to tag versions of a file-directory tree recursively relatively
compactly.

It won't be as compact as a git-aware solution (since git tends to to
rewrite entire files, which will confuse file-based incremental
differential backup), but the amount of data we are talking about is
pretty small, and as far as a lowest-common-denominator tradeoff for
use in emergencies, I have to give it a lot of praise. The main
advantage it has here is it implements point-in-time recovery
operations that easy to use and actually seem to work. That said,
I've mostly done targeted recoveries rather than trying to recover
entire trees.

--
fdr

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Cédric Villemain
cedric@2ndquadrant.com
In reply to: Daniel Farina (#8)
Re: Interesting post-mortem on a near disaster with git

Le lundi 25 mars 2013 19:35:12, Daniel Farina a écrit :

On Mon, Mar 25, 2013 at 11:07 AM, Stefan Kaltenbrunner

<stefan@kaltenbrunner.cc> wrote:

Back when we used CVS for quite a few years I kept 7 day rolling
snapshots of the CVS repo, against just such a difficulty as this. But
we seem to be much better organized with infrastructure these days so I
haven't done that for a long time.

well there is always room for improvement(and for learning from others)
- but I agree that this proposal seems way overkill. If people think we
should keep online "delayed" mirrors we certainly have the resources to
do that on our own if we want though...

What about rdiff-backup? I've set it up for personal use years ago
(via the handy open source bash script backupninja) years ago and it
has a pretty nice no-frills point-in-time, self-expiring, file-based
automatic backup program that works well with file synchronization
like rsync (I rdiff-backup to one disk and rsync the entire
rsync-backup output to another disk). I've enjoyed using it quite a
bit during my own personal-computer emergencies and thought the
maintenance required from me has been zero, and I have used it from
time to time to restore, proving it even works. Hardlinks can be used
to tag versions of a file-directory tree recursively relatively
compactly.

It won't be as compact as a git-aware solution (since git tends to to
rewrite entire files, which will confuse file-based incremental
differential backup), but the amount of data we are talking about is
pretty small, and as far as a lowest-common-denominator tradeoff for
use in emergencies, I have to give it a lot of praise. The main
advantage it has here is it implements point-in-time recovery
operations that easy to use and actually seem to work. That said,
I've mostly done targeted recoveries rather than trying to recover
entire trees.

I have the same set up, and same feedback.
--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

#10Jim Nasby
jim@nasby.net
In reply to: Cédric Villemain (#9)
Re: Interesting post-mortem on a near disaster with git

On 3/26/13 6:42 AM, Cédric Villemain wrote:

Le lundi 25 mars 2013 19:35:12, Daniel Farina a écrit :

On Mon, Mar 25, 2013 at 11:07 AM, Stefan Kaltenbrunner

<stefan@kaltenbrunner.cc> wrote:

Back when we used CVS for quite a few years I kept 7 day rolling

snapshots of the CVS repo, against just such a difficulty as this. But

we seem to be much better organized with infrastructure these days so I

haven't done that for a long time.

well there is always room for improvement(and for learning from others)

- but I agree that this proposal seems way overkill. If people think we

should keep online "delayed" mirrors we certainly have the resources to

do that on our own if we want though...

What about rdiff-backup? I've set it up for personal use years ago

(via the handy open source bash script backupninja) years ago and it

has a pretty nice no-frills point-in-time, self-expiring, file-based

automatic backup program that works well with file synchronization

like rsync (I rdiff-backup to one disk and rsync the entire

rsync-backup output to another disk). I've enjoyed using it quite a

bit during my own personal-computer emergencies and thought the

maintenance required from me has been zero, and I have used it from

time to time to restore, proving it even works. Hardlinks can be used

to tag versions of a file-directory tree recursively relatively

compactly.

It won't be as compact as a git-aware solution (since git tends to to

rewrite entire files, which will confuse file-based incremental

differential backup), but the amount of data we are talking about is

pretty small, and as far as a lowest-common-denominator tradeoff for

use in emergencies, I have to give it a lot of praise. The main

advantage it has here is it implements point-in-time recovery

operations that easy to use and actually seem to work. That said,

I've mostly done targeted recoveries rather than trying to recover

entire trees.

I have the same set up, and same feedback.

I had the same setup, but got tired of how rdiff-backup behaved when a backup was interrupted (very lengthy cleanup process). Since then I've switched to an rsync setup that does essentially the same thing as rdiff-backup (uses hardlinks between multiple copies).

The only downside I'm aware of is that my rsync backups aren't guaranteed to be "consistent" (for however consistent a backup of an active FS would be...).
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Daniel Farina
daniel@heroku.com
In reply to: Jim Nasby (#10)
Re: Interesting post-mortem on a near disaster with git

On Wed, Apr 3, 2013 at 6:18 PM, Jim Nasby <jim@nasby.net> wrote:

What about rdiff-backup? I've set it up for personal use years ago

(via the handy open source bash script backupninja) years ago and it

has a pretty nice no-frills point-in-time, self-expiring, file-based

automatic backup program that works well with file synchronization

like rsync (I rdiff-backup to one disk and rsync the entire

rsync-backup output to another disk). I've enjoyed using it quite a

bit during my own personal-computer emergencies and thought the

maintenance required from me has been zero, and I have used it from

time to time to restore, proving it even works. Hardlinks can be used

to tag versions of a file-directory tree recursively relatively

compactly.

It won't be as compact as a git-aware solution (since git tends to to

rewrite entire files, which will confuse file-based incremental

differential backup), but the amount of data we are talking about is

pretty small, and as far as a lowest-common-denominator tradeoff for

use in emergencies, I have to give it a lot of praise. The main

advantage it has here is it implements point-in-time recovery

operations that easy to use and actually seem to work. That said,

I've mostly done targeted recoveries rather than trying to recover

entire trees.

I have the same set up, and same feedback.

I had the same setup, but got tired of how rdiff-backup behaved when a
backup was interrupted (very lengthy cleanup process). Since then I've
switched to an rsync setup that does essentially the same thing as
rdiff-backup (uses hardlinks between multiple copies).

The only downside I'm aware of is that my rsync backups aren't guaranteed to
be "consistent" (for however consistent a backup of an active FS would
be...).

I forgot to add one more thing to my first mail, although it's very
important to my feeble recommendation: the problem is that blind
synchronization is a great way to propagate destruction.

rdiff-backup (but perhaps others, too) has a file/directory structure
that is, as far as I know, additive, and the pruning can be done
independently at different replicas that can have different
retention...and if done just right (I'm not sure about the case of
concurrent backups being taken) one can write a re-check that no files
are to be modified or deleted by the synchronization as a safeguard.

--
fdr

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers