pg_internal.init is hazardous to your health
Dirk Lutzebaeck and I just spent a tense couple of hours trying to
figure out why a large database Down Under wasn't coming up after being
reloaded from a base backup plus PITR recovery. The symptoms were that
the recovery went fine, but backend processes would fail at startup or
soon after with "could not open relation XX/XX/XX: No such file" type of
errors.
The answer that ultimately emerged was that they'd been running a
nightly maintenance script that did REINDEX SYSTEM (among other things
I suppose). The PITR base backup included pg_internal.init files that
were appropriate when it was taken, and the PITR recovery process did
nothing whatsoever to update 'em :-(. So incoming backends picked up
init files with obsolete relfilenode values.
We don't actually need to *update* the file, per se, we only need to
remove it if no longer valid --- the next incoming backend will rebuild
it. I could see fixing this by making WAL recovery run around and zap
all the .init files (only problem is to find 'em), or we could add a new
kind of WAL record saying "remove the .init file for database XYZ"
to be emitted whenever someone removes the active one. Thoughts?
Meanwhile, if you're trying to recover from a PITR backup and it's not
working, try removing any pg_internal.init files you can find.
regards, tom lane
On Tue, 17 Oct 2006, Tom Lane wrote:
Dirk Lutzebaeck and I just spent a tense couple of hours trying to
figure out why a large database Down Under wasn't coming up after being
reloaded from a base backup plus PITR recovery. The symptoms were that
the recovery went fine, but backend processes would fail at startup or
soon after with "could not open relation XX/XX/XX: No such file" type of
errors.The answer that ultimately emerged was that they'd been running a
nightly maintenance script that did REINDEX SYSTEM (among other things
I suppose). The PITR base backup included pg_internal.init files that
were appropriate when it was taken, and the PITR recovery process did
nothing whatsoever to update 'em :-(. So incoming backends picked up
init files with obsolete relfilenode values.
Ouch.
We don't actually need to *update* the file, per se, we only need to
remove it if no longer valid --- the next incoming backend will rebuild
it. I could see fixing this by making WAL recovery run around and zap
all the .init files (only problem is to find 'em), or we could add a new
kind of WAL record saying "remove the .init file for database XYZ"
to be emitted whenever someone removes the active one. Thoughts?
The latter seems the Right Way except, I guess, that the decision to
remove the file is buried deep inside inval.c.
Thanks,
Gavin
On Tue, 2006-10-17 at 22:29 -0400, Tom Lane wrote:
Dirk Lutzebaeck and I just spent a tense couple of hours trying to
figure out why a large database Down Under wasn't coming up after being
reloaded from a base backup plus PITR recovery. The symptoms were that
the recovery went fine, but backend processes would fail at startup or
soon after with "could not open relation XX/XX/XX: No such file" type of
errors.
Understand the tension...
The answer that ultimately emerged was that they'd been running a
nightly maintenance script that did REINDEX SYSTEM (among other things
I suppose). The PITR base backup included pg_internal.init files that
were appropriate when it was taken, and the PITR recovery process did
nothing whatsoever to update 'em :-(. So incoming backends picked up
init files with obsolete relfilenode values.
OK, I'm looking at this now for later discussion.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-10-18 at 12:49 +1000, Gavin Sherry wrote:
We don't actually need to *update* the file, per se, we only need to
remove it if no longer valid --- the next incoming backend will rebuild
it. I could see fixing this by making WAL recovery run around and zap
all the .init files (only problem is to find 'em), or we could add a new
kind of WAL record saying "remove the .init file for database XYZ"
to be emitted whenever someone removes the active one. Thoughts?
Yes, that assessment seems good.
The latter seems the Right Way except, I guess, that the decision to
remove the file is buried deep inside inval.c.
I'd prefer the zap everything approach, but emitting a WAL record looks
mostly straightforward and just as good.
RelationCacheInitFileInvalidate() can easily emit a WAL record. This is
called twice in succession, so we would emit WAL on the
RelationCacheInitFileInvalidate(true) call only. I'll work out a patch
for that...XLOG_XACT_RELCACHE_INVALIDATE
RelationCacheInitFileInvalidate() is also called on each
FinishPreparedTransaction(). If that is called 100% of the time, then we
can skip writing an additional record for prepared transactions by
triggering the removal of pg_internal.init when we see a
XLOG_XACT_COMMIT_PREPARED during replay.
Not sure whether we need to do that, Heikki? Anyone?
I'm guessing no, but it seems sensible to check.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes:
RelationCacheInitFileInvalidate() is also called on each
FinishPreparedTransaction().
Surely not...
regards, tom lane
On Wed, 2006-10-18 at 13:24 -0400, Tom Lane wrote:
"Simon Riggs" <simon@2ndquadrant.com> writes:
RelationCacheInitFileInvalidate() is also called on each
FinishPreparedTransaction().Surely not...
I take that to mean there's nothing special about prepared transactions
and invalidating the rel cache, so we *do* need to have a separate WAL
record in all cases.
OK, I'll write up a patch later today (working in US for few days).
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote:
RelationCacheInitFileInvalidate() is also called on each
FinishPreparedTransaction().
It's only called if the prepared transaction invalidated the init file.
If that is called 100% of the time, then we
can skip writing an additional record for prepared transactions by
triggering the removal of pg_internal.init when we see a
XLOG_XACT_COMMIT_PREPARED during replay.
Not sure whether we need to do that, Heikki? Anyone?
I'm guessing no, but it seems sensible to check.
If you write the WAL record in RelationCacheInitFileInvalidate(true),
that's enough. No extra handling for prepared transactions is needed.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, 2006-10-18 at 15:56 +0100, Simon Riggs wrote:
On Tue, 2006-10-17 at 22:29 -0400, Tom Lane wrote:
The answer that ultimately emerged was that they'd been running a
nightly maintenance script that did REINDEX SYSTEM (among other things
I suppose). The PITR base backup included pg_internal.init files that
were appropriate when it was taken, and the PITR recovery process did
nothing whatsoever to update 'em :-(. So incoming backends picked up
init files with obsolete relfilenode values.OK, I'm looking at this now for later discussion.
I've coded a patch and am just testing now.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com