I don't want to back up index files

Started by Glen Parkerabout 17 years ago29 messagesgeneral
Jump to latest
#1Glen Parker
glenebob@nwlink.com

I am wondering the feasibility of having PG continue to work even if
non-essential indexes are gone or corrupt. I brought this basic concept
up at some point in the past, but now I have a different motivation, so
I want to strike up discussion about it again. This time around, I
simply don't want to back up indexes if I don't have to. Because
indexes contain essentially redundant data, losing one does not equate
to losing real data. Therefore, backing them up represents a lot of
overhead for very little benefit.

Here's the basic idea:

1) New field to pg_index (indvalid boolean).
2) Query planner skips indexes where indvalid = false.
3) Executer does not update indexes where indvalid = false.
4) Executer refuses insert or update to unique columns where indvalid =
false, throwing an error.
5) WAL roll forward marks indvalid = false if index file(s) are missing,
rather than panicking.
6) REINDEX recognizes syntax to only build indexes with indvalid =
false, marks indvalid = true.

Close to 25% of the on disk bulk of my database is index files. It
would save a significant amount of the system resources used during the
backup, if I didn't have to archive the index files. In the unlikely
event that a restore/roll forward becomes necessary, I could simply
issue something like "REINDEX DATABASE foo INVALID;" to restore all the
missing indexes and return the database to full function. Prior to a
reindex, the database would perform poorly and refuse to do certain
inserts and updates, but the data would be available. Backup files
would be smaller, and the restore/roll forward would be faster.

No down sides jump out at me, and it seems to me that for a regular PG
code hacker this could actually be fairly simple to implement.

Any chance of something like this being done in the future?

-Glen

#2Joshua D. Drake
jd@commandprompt.com
In reply to: Glen Parker (#1)
Re: I don't want to back up index files

On Tue, 2009-03-10 at 18:54 -0700, Glen Parker wrote:

I am wondering the feasibility of having PG continue to work even if
non-essential indexes are gone or corrupt. I brought this basic concept
up at some point in the past, but now I have a different motivation, so
I want to strike up discussion about it again. This time around, I
simply don't want to back up indexes if I don't have to. Because
indexes contain essentially redundant data, losing one does not equate
to losing real data. Therefore, backing them up represents a lot of
overhead for very little benefit.

Hello,

I am sorry but this seems very silly. If you don't want to back up
indexes use pg_dump.

Any chance of something like this being done in the future?

I am going to go out on a limb here and say, "no".

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

#3Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Joshua D. Drake (#2)
Re: I don't want to back up index files

On Wed, Mar 11, 2009 at 3:03 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

Any chance of something like this being done in the future?

I am going to go out on a limb here and say, "no".

That would probably be possible, by placing all indicies in a separate
directory in data, but....
Well, that would create whole another dimension of mess and
requirements on restore.

So like JD said, if you don't want to dump indicies - just use pg_dump...

--
GJ

#4Glen Parker
glenebob@nwlink.com
In reply to: Grzegorz Jaśkiewicz (#3)
Re: I don't want to back up index files

Grzegorz Jaśkiewicz wrote:

On Wed, Mar 11, 2009 at 3:03 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

Any chance of something like this being done in the future?

I am going to go out on a limb here and say, "no".

That would probably be possible, by placing all indicies in a separate
directory in data, but....

I could simply use the system catalog to determine which files to backup
and which to ignore. I suppose I would prefer to backup all unique
indexes, and perhaps all system indexes. It's the proliferation of
large indexes that serve only to enhance select performance that I would
want to ignore.

So like JD said, if you don't want to dump indicies - just use pg_dump...

If pg_dump were an acceptable backup tool, we wouldn't need PITR, would
we? We used pg_dump for years. There's a very good reason we no longer
do. That suggestion is silly.

-Glen

#5Joshua D. Drake
jd@commandprompt.com
In reply to: Glen Parker (#4)
Re: I don't want to back up index files

On Wed, 2009-03-11 at 14:25 -0700, Glen Parker wrote:

Grzegorz Jaśkiewicz wrote:

So like JD said, if you don't want to dump indicies - just use pg_dump...

If pg_dump were an acceptable backup tool, we wouldn't need PITR, would
we? We used pg_dump for years. There's a very good reason we no longer
do. That suggestion is silly.

*shrug* you can consider it silly. It doesn't change the outcome.

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

#6Scott Marlowe
scott.marlowe@gmail.com
In reply to: Glen Parker (#4)
Re: I don't want to back up index files

On Wed, Mar 11, 2009 at 3:25 PM, Glen Parker <glenebob@nwlink.com> wrote:

Grzegorz Jaśkiewicz wrote:

So like JD said, if you don't want to dump indicies - just use pg_dump...

If pg_dump were an acceptable backup tool, we wouldn't need PITR, would we?
 We used pg_dump for years.  There's a very good reason we no longer do.
 That suggestion is silly.

pg_dump is a perfectly acceptable backup tool, as is PITR. They have
different ways of operating based on what you need. Trying to make
PITR act more like pg_dump seems kind of silly to me.

#7Glen Parker
glenebob@nwlink.com
In reply to: Scott Marlowe (#6)
Re: I don't want to back up index files

Scott Marlowe wrote:

pg_dump is a perfectly acceptable backup tool, as is PITR. They have
different ways of operating based on what you need. Trying to make
PITR act more like pg_dump seems kind of silly to me.

pg_dump is not acceptable to us because of the potential to lose many
hours of valuable data. Why would pg_dump even be relevant to this
discussion? PITR offers a benefit that pg_dump does not, a benefit that
we, and countless other organizations, obviously find useful.
Suggesting that a person who's been managing PG in a commercial setting
since version 6.4 should just use pg_dump as an alternative to PITR is,
well, rather insulting.

That's two people now who have called the idea "silly" without even a
hint of a supporting argument. Why would it be "silly" to improve the
performance of a highly valuable tool set without compromising its
utility? Am I missing something here? That's certainly possible, but
the idea didn't just hatch last night; I've put enough thought into this
to have reason to believe it's more than just "silly".

-Glen

#8Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Glen Parker (#7)
Re: I don't want to back up index files

Glen Parker escribi�:

That's two people now who have called the idea "silly" without even a
hint of a supporting argument. Why would it be "silly" to improve the
performance of a highly valuable tool set without compromising its
utility? Am I missing something here? That's certainly possible, but
the idea didn't just hatch last night; I've put enough thought into this
to have reason to believe it's more than just "silly".

FWIW I don't think this idea is silly at all. It's so not-silly, in
fact, that we already have some access methods that do this if an index
cannot be recovered (I think at least GiST does it).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#8)
Re: I don't want to back up index files

Alvaro Herrera <alvherre@commandprompt.com> writes:

FWIW I don't think this idea is silly at all. It's so not-silly, in
fact, that we already have some access methods that do this if an index
cannot be recovered (I think at least GiST does it).

Well, there's a difference between "rebuild the index when it can't be
recovered" and "lose the index anytime the system burps". AFAICS what
Glen is proposing is to not WAL-log index changes, and with that any
crash no matter how minor would have to invalidate indexes.

regards, tom lane

#10Joshua D. Drake
jd@commandprompt.com
In reply to: Glen Parker (#7)
Re: I don't want to back up index files

On Wed, 2009-03-11 at 16:57 -0700, Glen Parker wrote:

Scott Marlowe wrote:

That's two people now who have called the idea "silly" without even a
hint of a supporting argument. Why would it be "silly" to improve the
performance of a highly valuable tool set without compromising its
utility? Am I missing something here? That's certainly possible, but
the idea didn't just hatch last night; I've put enough thought into this
to have reason to believe it's more than just "silly".

O.k. a couple of things:

1. You could put all your indexes into a table space, this would allow
you to "try" different things with the indexes.

2. Even though my peer Alvaro doesn't think the idea is silly, I still
do and here is why. If you can invalidate the indexes you will have to
reindex (or recreate) to make them valid (as you mentioned). That is an
exclusive lock.

If your database has any level of concurrency the cost to
recreate/reindex those indexes right when you are attempting to get your
standby into production is going to be very high. Basically you are
trading 25% hard disk space for a longer, possibly excessively longer
outage. Hard disk space is so darn cheap that it doesn't seem to make
any sense.

Creating indexes concurrently is also out because while you are creating
those indexes your performance will tank because everything is
sequential scanning and there is a possibility that the concurrent
creation will fail.

Sincerely,

Joshua D. Drake

-Glen

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

#11Glen Parker
glenebob@nwlink.com
In reply to: Tom Lane (#9)
Re: I don't want to back up index files

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

FWIW I don't think this idea is silly at all. It's so not-silly, in
fact, that we already have some access methods that do this if an index
cannot be recovered (I think at least GiST does it).

Well, there's a difference between "rebuild the index when it can't be
recovered" and "lose the index anytime the system burps". AFAICS what
Glen is proposing is to not WAL-log index changes, and with that any
crash no matter how minor would have to invalidate indexes.

Nooo...! This has nothing to do with WAL logging index changes. What I
propose would have no effect on an end user that continues to back up
indexes. It would give people the *option* to not back up indexes.

This is about disaster recovery and the backups required to recover
sanely, not about hiccups that cause only the last handful of
transactions to be redone.

Causing invalidation of indexes after a crash might be acceptable in
some settings, but not in mine, and that isn't what I'm after.

-Glen

#12Glen Parker
glenebob@nwlink.com
In reply to: Joshua D. Drake (#10)
Re: I don't want to back up index files

Joshua D. Drake wrote:

1. You could put all your indexes into a table space, this would allow
you to "try" different things with the indexes.

Most of them are, but I still have to back them up in order to have a
valid backup, because the PITR code would choke if any are missing.

2. Even though my peer Alvaro doesn't think the idea is silly, I still
do and here is why. If you can invalidate the indexes you will have to
reindex (or recreate) to make them valid (as you mentioned). That is an
exclusive lock.

If your database has any level of concurrency the cost to
recreate/reindex those indexes right when you are attempting to get your
standby into production is going to be very high. Basically you are
trading 25% hard disk space for a longer, possibly excessively longer
outage. Hard disk space is so darn cheap that it doesn't seem to make
any sense.

We have yet to recover from a PG disaster. We back up every night, and
never use the back ups for anything. To me, it seems perfectly
reasonable to get a quicker back up every night, with the remote
possibility of ever having to pay the price for it. And IMO, the price
on recovery would be minor, if not negative. The time to rebuild the
indexes would be offset by the shorter time needed to restore the
database cluster and allow the PITR stuff to work its magic. And of
course keep in mind, that while indexes are building, all the tables not
being indexed are accessible. Nothing is accessible during a PITR
operation. I would also be able to rebuild only the most important
indexes and leave the system to operate with a few missing until things
quiet down, and then finish reindexing.

-Glen

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Glen Parker (#11)
Re: I don't want to back up index files

Glen Parker <glenebob@nwlink.com> writes:

Tom Lane wrote:

... AFAICS what
Glen is proposing is to not WAL-log index changes, and with that any
crash no matter how minor would have to invalidate indexes.

Nooo...! This has nothing to do with WAL logging index changes.

How so? In any PITR-based situation it seems to me you need to worry
about the WAL bulk a lot more than the bulk of the base backup.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Glen Parker (#12)
Re: I don't want to back up index files

Glen Parker <glenebob@nwlink.com> writes:

We have yet to recover from a PG disaster. We back up every night, and
never use the back ups for anything. To me, it seems perfectly
reasonable to get a quicker back up every night, with the remote
possibility of ever having to pay the price for it.

Why don't you just switch to a less frequent full-backup schedule?

regards, tom lane

#15Scott Marlowe
scott.marlowe@gmail.com
In reply to: Glen Parker (#7)
Re: I don't want to back up index files

On Wed, Mar 11, 2009 at 5:57 PM, Glen Parker <glenebob@nwlink.com> wrote:

Scott Marlowe wrote:

pg_dump is a perfectly acceptable backup tool, as is PITR.  They have
different ways of operating based on what you need.  Trying to make
PITR act more like pg_dump seems kind of silly to me.

pg_dump is not acceptable to us because of the potential to lose many hours
of valuable data.

That's why we use pg_dump and slony. We lose no data without a giant
hosting center fire.

 Why would pg_dump even be relevant to this discussion?

Because it's one method of backing up a database?

 PITR offers a benefit that pg_dump does not, a benefit that we, and
countless other organizations, obviously find useful.

And this benefit is? Continuous backup I assume. Something other
forms of replication can do, and which pg_dump then becomes a useful
backup tool.

Suggesting that a
person who's been managing PG in a commercial setting since version 6.4
should just use pg_dump as an alternative to PITR is, well, rather
insulting.

Darn, I've only been around since 6.5.2, and have about four years on
an airline reservation system. I guess my opinions just don't count.
How could I possibly understand your advanced thought processes.

That's two people now who have called the idea "silly" without even a hint
of a supporting argument.  Why would it be "silly" to improve the
performance of a highly valuable tool set without compromising its utility?

Because it's the size of the WAL files that kills most people, and not
putting the index updates into WAL files would be a hack I wouldn't
trust, and having them on the otherside but not adding them is just
wasting space? Cause maybe, you didn't explain everything as clearly
as you could, and I made assumptions based on your incomplete
description?

#16Joshua D. Drake
jd@commandprompt.com
In reply to: Scott Marlowe (#15)
Re: I don't want to back up index files

On Wed, 2009-03-11 at 20:59 -0600, Scott Marlowe wrote:

On Wed, Mar 11, 2009 at 5:57 PM, Glen Parker <glenebob@nwlink.com> wrote:

Scott Marlowe wrote:

Suggesting that a
person who's been managing PG in a commercial setting since version 6.4
should just use pg_dump as an alternative to PITR is, well, rather
insulting.

Darn, I've only been around since 6.5.2, and have about four years on
an airline reservation system. I guess my opinions just don't count.
How could I possibly understand your advanced thought processes.

Alright guys, let's not get into a my elephant trunk is bigger than
yours fight.

Joshua D. Drake

--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997

#17Alan Hodgson
ahodgson@simkin.ca
In reply to: Glen Parker (#12)
Re: I don't want to back up index files

On Wednesday 11 March 2009, Glen Parker <glenebob@nwlink.com> wrote:

We have yet to recover from a PG disaster. We back up every night, and
never use the back ups for anything. To me, it seems perfectly
reasonable to get a quicker back up every night, with the remote
possibility of ever having to pay the price for it.

This isn't totally relevant, but as a way to speed up base backups, I keep a
copy of the whole database rsync'd on the backup machine. The base backup
consists of refreshing the rsync, and then tarring it offline (of course
with the PITR backup commands in the sequence). My database is currently
750GB on disk and gets many tens of millions of updates a day, and the
rsync still runs in less than an hour per night.

I have done PITR recoveries (unfortunately). The log replay time dwarfs the
time it takes to copy the index files back over (it is, honestly, slower
than the original transactions were). Additionally, copying them is faster
than rebuilding them would be. Also, I can't imagine bringing a database
online without the majority of the indexes in place. The first dozen
non-indexed queries against large tables would kill the machine; not only
would you not be servicing users, but the rest of your restore would be
slowed immensely.

--
Even a sixth-grader can figure out that you can’t borrow money to pay off
your debt

#18Glen Parker
glenebob@nwlink.com
In reply to: Tom Lane (#13)
Re: I don't want to back up index files

Tom Lane wrote:

Glen Parker <glenebob@nwlink.com> writes:

Tom Lane wrote:

... AFAICS what
Glen is proposing is to not WAL-log index changes, and with that any
crash no matter how minor would have to invalidate indexes.

Nooo...! This has nothing to do with WAL logging index changes.

How so? In any PITR-based situation it seems to me you need to worry
about the WAL bulk a lot more than the bulk of the base backup.

It isn't the bulk so much as the amount of time, and the impact to the
running system during that time, that it takes to execute the base backup.

I haven't noticed any real impact related to compressing and exporting
WAL files.

Anyway, more to the point, I'm not knowingly proposing anything that
should cause reduced system reliability in the event of a crash.

-Glen

#19Glen Parker
glenebob@nwlink.com
In reply to: Tom Lane (#14)
Re: I don't want to back up index files

Tom Lane wrote:

Glen Parker <glenebob@nwlink.com> writes:

We have yet to recover from a PG disaster. We back up every night, and
never use the back ups for anything. To me, it seems perfectly
reasonable to get a quicker back up every night, with the remote
possibility of ever having to pay the price for it.

Why don't you just switch to a less frequent full-backup schedule?

Paranoia. Others in the organization have pushed to keep up with the
nightly back ups, so that decision is mostly out of my hands.

-Glen

#20Glen Parker
glenebob@nwlink.com
In reply to: Scott Marlowe (#15)
Re: I don't want to back up index files

Scott Marlowe wrote:

That's two people now who have called the idea "silly" without even a hint
of a supporting argument. Why would it be "silly" to improve the
performance of a highly valuable tool set without compromising its utility?

Because it's the size of the WAL files that kills most people, and not
putting the index updates into WAL files would be a hack I wouldn't
trust, and having them on the otherside but not adding them is just
wasting space? Cause maybe, you didn't explain everything as clearly
as you could, and I made assumptions based on your incomplete
description?

I've re-read my original email and I just can't see how anybody got the
idea I was suggesting to not WAL record index changes. That would be
insanity IMO.

So, to be clear, I'm not proposing any change to the way data is written
to the WAL files.

-Glen

#21Joshua D. Drake
jd@commandprompt.com
In reply to: Glen Parker (#19)
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Glen Parker (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Glen Parker (#20)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alan Hodgson (#17)
#25Glen Parker
glenebob@nwlink.com
In reply to: Tom Lane (#23)
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Glen Parker (#18)
#27Jeremy Harris
jgh@wizmail.org
In reply to: Simon Riggs (#26)
#28Glen Parker
glenebob@nwlink.com
In reply to: Jeremy Harris (#27)
#29Lew
noone@lwsc.ehost-services.com
In reply to: Glen Parker (#7)