check for missing tablespaces?
I am in the middle of helping a customer recover from a situation where
a tablespace was missing when a machine was rebooted and postgres
restarted, and I'm wondering if we should not have some sort of check
for this on startup. Maybe we could check for the existence of the
PG_VERSION file or something like that? Of course, that wouldn't help if
the tablespace were subsequently lost, but it doesn't seem like a
terribly expensive or unreasonable sanity check for startup.
cheers
andrew
* Andrew Dunstan (andrew@dunslane.net) wrote:
I am in the middle of helping a customer recover from a situation where
a tablespace was missing when a machine was rebooted and postgres
restarted, and I'm wondering if we should not have some sort of check
for this on startup. Maybe we could check for the existence of the
PG_VERSION file or something like that? Of course, that wouldn't help if
the tablespace were subsequently lost, but it doesn't seem like a
terribly expensive or unreasonable sanity check for startup.
I agree entirely with doing this. Now that we've got tablespaces and
our users are using them to split things on to seperate partitions which
could be accessed in different ways (eg: a tablespace on local storage
and one on a SAN) it makes alot of sense to make sure everything is in
order before allowing the system to completely come up. I'd much rather
PG not start when the SAN mappings aren't in place.
Thanks,
Stephen
Andrew Dunstan <andrew@dunslane.net> writes:
I am in the middle of helping a customer recover from a situation where
a tablespace was missing when a machine was rebooted and postgres
restarted, and I'm wondering if we should not have some sort of check
for this on startup. Maybe we could check for the existence of the
PG_VERSION file or something like that?
... and do what?
What exactly went wrong so badly with the missing tablespace? Doesn't
seem like it should be any worse than if individual table files went
missing.
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I am in the middle of helping a customer recover from a situation where
a tablespace was missing when a machine was rebooted and postgres
restarted, and I'm wondering if we should not have some sort of check
for this on startup. Maybe we could check for the existence of the
PG_VERSION file or something like that?... and do what?
What exactly went wrong so badly with the missing tablespace? Doesn't
seem like it should be any worse than if individual table files went
missing.
Well, in the present instance probably nothing drastic went wrong. The
tablespace only contained indexes - if an index is missing then updates
to the base table will fail, right?
In general, I think I'd probably prefer normal database startup to fail
if a tablespace is missing. That way I will know about it right then and
can remedy it. This is something that is much more likely to happen than
an arbitrary missing file, ISTM, and at the same time the check would be
much easier than looking for an arbitrary file. At the very least we
could issue a warning.
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Tom Lane wrote:
... and do what?
In general, I think I'd probably prefer normal database startup to fail
if a tablespace is missing. That way I will know about it right then and
can remedy it. This is something that is much more likely to happen than
an arbitrary missing file, ISTM, and at the same time the check would be
much easier than looking for an arbitrary file. At the very least we
could issue a warning.
So what you're imagining is
* iterate through each symlink in $PGDATA/pg_tblspc
* check that PG_VERSION exists (and has the right contents??) in
each pointed-to directory
* fail if not
I guess this is reasonable, since we make a similar check for the core
data directory itself. Watch out for the initdb sequence though.
regards, tom lane
Tom Lane wrote:
So what you're imagining is
* iterate through each symlink in $PGDATA/pg_tblspc
* check that PG_VERSION exists (and has the right contents??) in
each pointed-to directory
* fail if notI guess this is reasonable, since we make a similar check for the core
data directory itself. Watch out for the initdb sequence though.
I'll put this on my TODO list.
I'm not sure about the initdb reference - there won't be any tablespaces
to check for during initdb, will there?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
I'm not sure about the initdb reference - there won't be any tablespaces
to check for during initdb, will there?
No, but I think pg_tblspc/ itself might not be there either. Just a
case to test your patch on ...
regards, tom lane
Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
I'm not sure about the initdb reference - there won't be any tablespaces
to check for during initdb, will there?No, but I think pg_tblspc/ itself might not be there either. Just a
case to test your patch on ...
Sure it is, it's set up by initdb along with the other subdirectories
before it creates the PG_VERSION files.
Anyway, I don't know that there's any urgency about this.
cheers
andrew
Andrew Dunstan wrote:
I am in the middle of helping a customer recover from a situation
where a tablespace was missing when a machine was rebooted and
postgres restarted,
Have you uncovered why the tablespace when missing?
and I'm wondering if we should not have some sort of check for this on
startup
+1
Andrew Chernow
Andrew Dunstan wrote:
I am in the middle of helping a customer recover from a situation
where a tablespace was missing when a machine was rebooted and
postgres restarted,
Have you uncovered why the tablespace went missing?
and I'm wondering if we should not have some sort of check for this on
startup
+1
Andrew Chernow
Andrew Chernow wrote:
Andrew Dunstan wrote:
I am in the middle of helping a customer recover from a situation
where a tablespace was missing when a machine was rebooted and
postgres restarted,Have you uncovered why the tablespace went missing?
No. It's on a SAN, and I understand our hardware guys are talking to the
SAN vendor to get to the bottom of it. The network in the data centre
was reconfigured last night, but that should not in fact have affected
the SAN AFAIK. Other SAN devices mounted quite happily, including the
one containing the main data directory.
Anyway, from this POV all we really need to know is that the device
hosting this tablespace failed to mount when the machine was rebooted,
and then postgres restarted.
cheers
andrew
Anyway, from this POV all we really need to know is that the device
hosting this tablespace failed to mount when the machine was rebooted,
and then postgres restarted.
Good to know postgresql had nothing to do with the missing data. I
wasn't sure if it was user error, config problem or hardware.
From my experience, it doesn't really matter if you have a
sophisticated SAN or put together an ad-hoc nas/das solution, storage
likes to bite you. The only thing that helps me sleep at night is sound
replication and backups (in some cases triple redundancy with aging data
sets).
Andrew