BUG #5599: Vacuum fails due to index corruption issues
The following bug has been logged online:
Bug reference: 5599
Logged by: Hitesh Bhambhani
Email address: hiteshb@asg.com
PostgreSQL version: 8.2.9-1
Operating system: Microsoft Windows Server 2003, Enterprise Edition
Description: Vacuum fails due to index corruption issues
Details:
Hi,
We are seeing a problem in our application where the table indexes get
corrupted. This application is a Java Webapp with postgre as the backend.
The Webapp kicks off Vacuum at regular intervals. After running the
application for a while, one of our customers noted that the Vacuum fails
and Webapp gets very slow.
A re-index of the full database works fine and resolves the issue. But it
re-occurs within a day.
Based on some logs in the Webapp I can see that there were some errors in
truncating relations. Once those errors disappear the index corruption
errors start. I'm not sure if there is a connection here.
Here is a sample log message from the Webapp that shows the truncate error:
2010-07-08 06:29:54,641 WARN DefaultQuartzScheduler_Worker-4
maintenance.PostgreSqlVacuumer:32 - runVacuumFull(): running VACUUM FULL
VERBOSE
2010-07-08 06:29:56,672 ERROR DefaultQuartzScheduler_Worker-4
core.JobRunShell:211 - Job DEFAULT.postgreSqlVacuumJob threw an unhandled
Exception:
org.springframework.jdbc.BadSqlGrammarException: Hibernate-related JDBC
operation;
bad SQL grammar []; nested exception is org.postgresql.util.PSQLException:
ERROR: could not truncate relation 1663/16403/41274 to 30 blocks: Permission
denied
After a couple of such truncate errors the Vacuum starts failing due to
'failed to re-find parent key in index', as seen in sample error log below:
2010-07-08 06:44:56,060 ERROR DefaultQuartzScheduler_Worker-1
core.JobRunShell:2
11 - Job DEFAULT.postgreSqlVacuumJob threw an unhandled Exception:
org.springframework.jdbc.UncategorizedSQLException: Hibernate-related JDBC
operation;
uncategorized SQLException for SQL []; SQL state [XX000]; error code [0];
ERROR: failed to re-find parent key in index "pmoinstance_idx_pmotypeid" for
deletion target page 30;
nested exception is org.postgresql.util.PSQLException: ERROR: failed to
re-find parent key in index "pmoinstance_idx_pmotypeid" for deletion target
page 30
Here the index "pmoinstance_idx_pmotypeid" is one of several application
specific indexes.
Once this index corruption issue occurs, the Vacuum job keeps failing until
a re-index is done. In the long run, we don't want to keep re-indexing the
database as a scheduled job so we would like your help to get to the bottom
of this.
So how can we avoid these index corruption errors and are there any known
causes?
Also, please let me know if there is a direct link between the truncate
relation errors that I saw preceded the index corruption errors?
At the time these Webapp logs were collected, the database was set to
produce verbose logging so I don't have database logs, sorry. Please do let
me know what other information I can provide to help you diagnose this
situation.
Thanks for your time.
Regards,
Hitesh
On Wed, Aug 4, 2010 at 10:47 PM, Hitesh Bhambhani <hiteshb@asg.com> wrote:
PostgreSQL version: 8.2.9-1
Firstly, the current release of 8.2 is 8.2.17. There are a long list
of bugs fixed in those intervening releases including one involving
vacuum truncating relations. I don't think it's the same problem but I
would recommend updating immediately to 8.2.17.
2010-07-08 06:29:54,641 WARN DefaultQuartzScheduler_Worker-4
maintenance.PostgreSqlVacuumer:32 - runVacuumFull(): running VACUUM FULL
VERBOSE
Secondly we don't recommend running VACUUM FULL routinely. It should
only be necessary in extraordinary circumstances. Normally a plain
VACUUM (or VACUUM ANALYZE or VACUUM VERBOSE) should be sufficient as
long as it's being run regularly. Regular VACUUM without the "FULL"
has much less impact on the system.
2010-07-08 06:29:56,672 ERROR DefaultQuartzScheduler_Worker-4
core.JobRunShell:211 - Job DEFAULT.postgreSqlVacuumJob threw an unhandled
Exception:
org.springframework.jdbc.BadSqlGrammarException: Hibernate-related JDBC
operation;
bad SQL grammar []; nested exception is org.postgresql.util.PSQLException:
ERROR: could not truncate relation 1663/16403/41274 to 30 blocks: Permission
denied
"Permission denied" smells like a Windows problem with concurrent file
operations. Are you sure you're not running any anti-virus software or
backup software which could have these files open and prevent Postgres
from performing regular file operations on its files? Many people have
reported other problems with anti-virus software in particular.
Also, please let me know if there is a direct link between the truncate
relation errors that I saw preceded the index corruption errors?
I'm a bit surprised at that, but I haven't checked what exactly
happens when an error occurs truncating files myself so I'm not sure
if I should be or not.
--
greg
Excerpts from Hitesh Bhambhani's message of mié ago 04 17:47:12 -0400 2010:
Based on some logs in the Webapp I can see that there were some errors in
truncating relations. Once those errors disappear the index corruption
errors start. I'm not sure if there is a connection here.
There probably is. What kind of relation are the ones unable to
truncate? Please see in pg_class where relfilenode = '41274' in this
case:
bad SQL grammar []; nested exception is org.postgresql.util.PSQLException:
ERROR: could not truncate relation 1663/16403/41274 to 30 blocks: Permission
denied
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Greg, thanks for your answers.
My comments below...
From: gsstark@gmail.com [mailto:gsstark@gmail.com] On Behalf Of Greg Stark
Sent: Wednesday, August 04, 2010 9:35 PM
Firstly, the current release of 8.2 is 8.2.17. There are a long list of bugs
fixed in those intervening releases including one involving vacuum truncating
relations. I don't think it's the same problem but I would recommend updating
immediately to 8.2.17.
[HiteshB] I have noted your recommendation and will work with our Product Management to upgrade to the latest and greatest. Although we can't change the version that the customer has installed (8.2.9-1).
Secondly we don't recommend running VACUUM FULL routinely. It should only be
necessary in extraordinary circumstances. Normally a plain VACUUM (or VACUUM
ANALYZE or VACUUM VERBOSE) should be sufficient as long as it's being run
regularly. Regular VACUUM without the "FULL"
has much less impact on the system.
[HiteshB] Point noted. The decision to use VACUUM FULL is something we'll re-examine.
Could you give an example of what an 'extraordinary circumstance' would be?
"Permission denied" smells like a Windows problem with concurrent file
operations. Are you sure you're not running any anti-virus software or backup
software which could have these files open and prevent Postgres from
performing regular file operations on its files? Many people have reported
other problems with anti-virus software in particular.
[HiteshB] We have worked with the customer to exclude the postgre directory from their antivirus scans. Hopefully we won't see this Permission issue again.
Thanks again.
Regards,
Hitesh
Alvaro, thanks for your response.
My comments below...
From: Alvaro Herrera [mailto:alvherre@commandprompt.com]
Sent: Wednesday, August 04, 2010 11:30 PM
There probably is. What kind of relation are the ones unable to truncate?
Please see in pg_class where relfilenode = '41274' in this
case:
[HiteshB] the relation is called pmoinstance_idx_pmotypeid. This is an index on a table created by our product.
The definition for this index is:
CREATE INDEX pmoinstance_idx_pmotypeid ON pmoinstances USING btree (pmotype_id);
Regards,
Hitesh
On Thu, Aug 5, 2010 at 12:55 PM, Hitesh Bhambhani
<hitesh.bhambhani@asg.com> wrote:
[HiteshB] I have noted your recommendation and will work with our Product Management to upgrade to the latest and greatest. Although we can't change the version that the customer has installed (8.2.9-1).
The latest and greatest is 8.4.x which has lots of new features
including improvements to VACUUM. 9.0 will be out soon and will
replace VACUUM FULL entirely too.
However 8.2.17 is the same as the version you're using except with
dozens of known bugs fixed and security holes patched.
--
greg
Hitesh Bhambhani <hitesh.bhambhani@asg.com> wrote:
Could you give an example of what an 'extraordinary circumstance'
would be?
Normal vacuums will remove old tuples (versions of rows) which can
no longer be seen by any transaction, and make that space available
for re-use within the PostgreSQL files. It will not normally give
space back to the OS, but that's usually a *good* thing, because
normally the space will soon be needed again by PostgreSQL, and it
would be less efficient to constantly be giving space back and
allocating it again.
If you neglect to vacuum aggressively enough, or do a mass UPDATE or
DELETE which affects a large percentage of your rows, without
anticipating that you will need that space again soon, you might
want to do aggressive maintenance to shrink the PostgreSQL files.
VACUUM FULL will move tuples around within the table to free up
space at the end so that it can be released. But wait -- you
probably *still* don't want to use VACUUM FULL, because it is *very*
slow and will bloat your indexes, requiring a REINDEX to restore
decent performance. CLUSTER will rewrite the table without dead
space and will rebuild the indexes -- usually much faster than
VACUUM FULL. But CLUSTER needs room for a second copy of the table
in order to copy it. If you have a very bloated table which you
want to shrink and you don't have room for a second copy of it,
*that* is the time to consider VACUUM FULL (usually followed by
REINDEX).
If you ever find you *do* need to run VACUUM FULL, you probably need
to re-evaluate your maintenance procedures to see how you can avoid
having to do it again.
-Kevin
Hitesh Bhambhani <hitesh.bhambhani@asg.com> writes:
From: Alvaro Herrera [mailto:alvherre@commandprompt.com]
Sent: Wednesday, August 04, 2010 11:30 PM
There probably is. What kind of relation are the ones unable to truncate?
Please see in pg_class where relfilenode = '41274' in this
case:[HiteshB] the relation is called pmoinstance_idx_pmotypeid.
So the relation that couldn't be truncated is indeed the same one
complained of in the later message.
What it looks like to me is that page 30 was deleted, and then vacuum
tried to truncate it off the index altogether, and that failed because
of Windows randomness, and then later it's trying to delete page 30
again. Which naturally fails because the parent downlink entry is long
gone. But it's odd that it tries to delete page 30 twice. The page
should have been in BTP_DELETED state before the truncate attempt, and
therefore should still be dead later, so why's it trying again?
[ thinks for a bit ... ] I do see a way that could happen. The page
could still be live on disk when we reach smgrtruncate(): the update
to BTP_DELETED state might only exist in a dirty shared buffer. And
lookee here what smgrtruncate does:
/*
* Get rid of any buffers for the about-to-be-deleted blocks. bufmgr will
* just drop them without bothering to write the contents.
*/
DropRelFileNodeBuffers(reln->smgr_rnode, forknum, isTemp, nblocks);
So we throw away the BTP_DELETED state update without ever writing it to
disk, and then when the truncate fails, the old page state is still out
there, ready to confuse us later.
Seems like we need to think harder about recovering from a truncate
failure. A few random ideas:
1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.
2. Treat truncation failure as a PANIC condition; then WAL replay will
fix things. Unpleasant. Actually, worse than unpleasant: I think what
the comment in RelationTruncate() is pointing out is that if the
failure is repeatable, we'd fail to recover at all.
3. Don't discard the dirty buffers until after successfully truncating.
The hard part here is to be sure no other process (like bgwriter) will
try to write them in between. I don't see a way to guarantee that,
at least not without interlock infrastructure that doesn't exist today.
And see also that comment in RelationTruncate(). Seems like the whole
problem of coping with truncation failure needs more thought than we've
given it.
regards, tom lane
Excerpts from Tom Lane's message of jue ago 05 11:06:57 -0400 2010:
Seems like we need to think harder about recovering from a truncate
failure. A few random ideas:
Ugh.
1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.
I think "simple" is good, considering that this code is gone in 9.0 and
HEAD. IMHO investing too much effort on this problem is not worth it.
Perhaps it'd be good to come up with a better solution for HEAD:
3. Don't discard the dirty buffers until after successfully truncating.
The hard part here is to be sure no other process (like bgwriter) will
try to write them in between. I don't see a way to guarantee that,
at least not without interlock infrastructure that doesn't exist today.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of jue ago 05 11:06:57 -0400 2010:
1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.
I think "simple" is good, considering that this code is gone in 9.0 and
HEAD. IMHO investing too much effort on this problem is not worth it.
Gone? Looks like it's still there to me.
Perhaps it'd be good to come up with a better solution for HEAD:
Yeah, the panic-on-replay aspect is troublesome. I feel like we need a
rethink here. But I agree that solution #1 is the only one that feels
safe enough for backpatching.
regards, tom lane
Excerpts from Tom Lane's message of jue ago 05 12:36:24 -0400 2010:
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of jue ago 05 11:06:57 -0400 2010:
1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.I think "simple" is good, considering that this code is gone in 9.0 and
HEAD. IMHO investing too much effort on this problem is not worth it.Gone? Looks like it's still there to me.
I mean the btree code that does the truncation on vacuum full is
truncated. There are other uses for truncation, but it doesn't look to
that they are as problematic ... or are they?
Hmm, I guess truncation of heap on lazy vacuum is still a problem
precisely because page compaction will be forgotten.
Perhaps it'd be good to come up with a better solution for HEAD:
Yeah, the panic-on-replay aspect is troublesome. I feel like we need a
rethink here. But I agree that solution #1 is the only one that feels
safe enough for backpatching.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of jue ago 05 12:36:24 -0400 2010:
Gone? Looks like it's still there to me.
I mean the btree code that does the truncation on vacuum full is
truncated. There are other uses for truncation, but it doesn't look to
that they are as problematic ... or are they?
I think what Hitesh reported is a special case of a generic problem.
Hmm, I guess truncation of heap on lazy vacuum is still a problem
precisely because page compaction will be forgotten.
Page compaction is the least of it :-(
Imagine that we have some rows at the end of a table, we delete them,
we vacuum before the next checkpoint. Vacuum decides it can now
truncate away the last pages, but fails to do so. The original page
state is still on disk, which means we have lost the fact of the
deletion --- the rows are now effectively live again, though their
index entries are probably gone.
In any case, the removal of VACUUM FULL didn't completely disable
shrinking of btree indexes did it? I don't recall having removed
that.
regards, tom lane
Excerpts from Tom Lane's message of jue ago 05 13:19:41 -0400 2010:
In any case, the removal of VACUUM FULL didn't completely disable
shrinking of btree indexes did it? I don't recall having removed
that.
I see no call to RelationTruncate in the btvacuumscan code, but then it
was only called in vacuum full before. I'm not sure how it worked
previously with only lazy vacuum. Did we simply leave the pages as free
for a later extension?
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes:
Excerpts from Tom Lane's message of jue ago 05 13:19:41 -0400 2010:
In any case, the removal of VACUUM FULL didn't completely disable
shrinking of btree indexes did it? I don't recall having removed
that.
I see no call to RelationTruncate in the btvacuumscan code, but then it
was only called in vacuum full before. I'm not sure how it worked
previously with only lazy vacuum. Did we simply leave the pages as free
for a later extension?
You're right, I misremembered. That code is just plain gone in 9.0:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/nbtree/nbtree.c.diff?r1=1.174;r2=1.175;f=h
Still, we have a live issue with heap truncation during plain VACUUM.
However, the scope of the problem seems a lot less than I was thinking.
Maybe write-the-buffers-first is a sufficient longterm solution.
regards, tom lane
Excerpts from Tom Lane's message of jue ago 05 14:01:15 -0400 2010:
You're right, I misremembered. That code is just plain gone in 9.0:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/nbtree/nbtree.c.diff?r1=1.174;r2=1.175;f=hStill, we have a live issue with heap truncation during plain VACUUM.
However, the scope of the problem seems a lot less than I was thinking.
The scope is further reduced by the fact that this only seems to happen
on Windows, and then only when the antivirus is messing around with the
files.
Maybe write-the-buffers-first is a sufficient longterm solution.
Yeah, perhaps it is, though it's a pity that a single platform problem
is going to slow down everyone else.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 05/08/10 21:28, Alvaro Herrera wrote:
Excerpts from Tom Lane's message of jue ago 05 14:01:15 -0400 2010:
Maybe write-the-buffers-first is a sufficient longterm solution.
Yeah, perhaps it is, though it's a pity that a single platform problem
is going to slow down everyone else.
How about performing the truncate as usual, but if it fails, fill the
truncated portion with zeros instead? Zeroed pages should be handled
gracefully, and this would be a very non-invasive fix. Now if the
write() to zero the pages fails too, I guess we'll have to panic, but
that's not much different from flushing the dirty pages first.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 05/08/10 21:28, Alvaro Herrera wrote:
Excerpts from Tom Lane's message of jue ago 05 14:01:15 -0400 2010:
Maybe write-the-buffers-first is a sufficient longterm solution.
Yeah, perhaps it is, though it's a pity that a single platform problem
is going to slow down everyone else.
How about performing the truncate as usual, but if it fails, fill the
truncated portion with zeros instead?
This is blithely assuming that you're going to be able to do anything at
all to the file after having failed to truncate it. The permissions
problem might be persistent, or you might crash, or several other
things. If we could do this safely, we could also do the original #3
solution (write the dirty buffers for the to-be-truncated block range
only after failing to truncate).
regards, tom lane
On Thu, Aug 5, 2010 at 7:28 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
The scope is further reduced by the fact that this only seems to happen
on Windows, and then only when the antivirus is messing around with the
files.
So I suspect this could be triggered lots of ways. Imagine a ZFS
volume that runs out of space temporarily. Even truncate would need
additional blocks to replace the meta information. Or a network
filesystem like AFS where your kerberos tickets have expired and you
get a permission failure until they've been renewed. Or, in the case
of a very large table being truncated I suspect there's a
CHECK_FOR_INTERRUPTS lying around that can cancel the backend at the
wrong time.
It would be nice to have a regression test harness that caused system
calls to fail randomly -- the difficult part would be testing the
results.
--
greg
In the discussion of bug #5599 we pretty much agreed to do this:
Seems like we need to think harder about recovering from a truncate
failure. A few random ideas:
1. Write the dirty buffers before dropping them. Kind of ugly from a
performance viewpoint, but simple and safe.
I looked at making this happen, and noted that DropRelFileNodeBuffers
is used both for the truncation case and for dropping relation buffers
during smgrdounlink. In the latter case, it's still appropriate to
drop dirty buffers without writing them, both for performance reasons
and because we don't really care about any errors: we have already
committed the relation DROP, and are not going to look at the file
contents again in any case. So this means that two different behaviors
are now required for truncation and dropping.
The cleanest fix is an API change to add a boolean write-or-not
parameter to DropRelFileNodeBuffers. That's what I want to do in HEAD
and 9.0, but I'm unsure whether it's a safe fix in the back branches.
Does anyone have an opinion whether it's likely that any third-party
code is calling DropRelFileNodeBuffers directly? If there is, then
changing its API in a minor release would be an unfriendly thing to do.
We could avoid that by some ugly expedient like inserting a second copy
of the function in back branches.
Comments?
regards, tom lane
On 15/08/10 21:58, Tom Lane wrote:
Does anyone have an opinion whether it's likely that any third-party
code is calling DropRelFileNodeBuffers directly?
I doubt it. External modules shouldn't be modifying relations at such a
low level.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com