cannot delete corrupted rows after DB corruption: tuple concurrently updated

Started by john galeabout 12 years ago4 messagesgeneral

john@smadness.com

about 12 years ago

We ran into an open file limit on the DB host (Mac OS X 10.9.0, Postgres 9.3.2) and caused the familiar "ERROR: unexpected chunk number 0 (expected 1) for toast value 155900302 in pg_toast_16822" when selecting data.

Previously when we've run into this kind of corruption we could find the specific corrupted rows in the table and delete by ctid. However, this time we're running into a persistent "ERROR: tuple concurrently updated" when deleting by ctid.

munin2=# select ctid from testruns where id = 141889653;
ctid
--------------
(37069816,3)
(1 row)

munin2=# delete from testruns where ctid = '(37069816,3)';
ERROR: tuple concurrently updated

This always occurs and seems to prevent us from cleaning up the database by removing the corrupted rows.

Before attempting to do more drastic things like restart the postgres instance, is there some known way of getting around this error and cleaning up the corruption (other than the full replicate / reindex / suggestions from around the web that are more involved than deleting corrupted rows by ctid).

thanks,

~ john

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

john gale

john@smadness.com

about 12 years ago

In reply to: john gale (#1)

Re: cannot delete corrupted rows after DB corruption: tuple concurrently updated

Does anybody have any ideas about this.

We restarted the postmaster and the issue persists. So previously in 9.0.4 where we could clean corruption, it seems in 9.3.2 we can no longer clean corruption.o I'm assuming this because our data insert environment has not changed, so we shouldn't be hitting any different transaction concurrency / isolation problems than we did before.

Is there a way to force deletion of a row, ignoring concurrency, similar to concurrent updates. It looks like changing default_transaction_isolation did not affect this:

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR: tuple concurrently updated

2014-02-26 07:42:46 GMT LOG: received SIGHUP, reloading configuration files
2014-02-26 07:42:46 GMT LOG: parameter "default_transaction_isolation" changed to "read uncommitted"
2014-02-26 07:42:53 GMT ERROR: tuple concurrently updated
2014-02-26 07:42:53 GMT STATEMENT: delete from testruns where ctid = '(37069305,4)';

thanks,

~ john

On Feb 25, 2014, at 11:43 AM, john gale <john@smadness.com> wrote:

We ran into an open file limit on the DB host (Mac OS X 10.9.0, Postgres 9.3.2) and caused the familiar "ERROR: unexpected chunk number 0 (expected 1) for toast value 155900302 in pg_toast_16822" when selecting data.

Previously when we've run into this kind of corruption we could find the specific corrupted rows in the table and delete by ctid. However, this time we're running into a persistent "ERROR: tuple concurrently updated" when deleting by ctid.

munin2=# select ctid from testruns where id = 141889653;
ctid
--------------
(37069816,3)
(1 row)

munin2=# delete from testruns where ctid = '(37069816,3)';
ERROR: tuple concurrently updated

This always occurs and seems to prevent us from cleaning up the database by removing the corrupted rows.

Before attempting to do more drastic things like restart the postgres instance, is there some known way of getting around this error and cleaning up the corruption (other than the full replicate / reindex / suggestions from around the web that are more involved than deleting corrupted rows by ctid).

thanks,

~ john

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 12 years ago

In reply to: john gale (#2)

Re: cannot delete corrupted rows after DB corruption: tuple concurrently updated

On 26 Únor 2014, 8:45, john gale wrote:

Does anybody have any ideas about this.

We restarted the postmaster and the issue persists. So previously in
9.0.4 where we could clean corruption, it seems in 9.3.2 we can no longer
clean corruption.o I'm assuming this because our data insert environment
has not changed, so we shouldn't be hitting any different transaction
concurrency / isolation problems than we did before.

Is there a way to force deletion of a row, ignoring concurrency, similar
to concurrent updates. It looks like changing
default_transaction_isolation did not affect this:

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR: tuple concurrently updated

AFAIK this error is raised when a before trigger modifies the row that is
being deleted. Imagine a trigger that does this

UPDATE testruns SET mycolumn = 1 WHERE id = OLD.id;
RETURN OLD;

Given the way MVCC in postgres works (copying row when updating), the
error makes sense. In 9.0 this worked by silently skipping the DELETE
(incidentally, I had a few reports about tables that can't be deleted
because of this in the past month).

Anyway, do you have any triggers on the table? If yes, try to disable
them. I suspect the data are corrupted in a way that causes update on the
deleted row - either directly, or maybe because of a cascading effect.

I'm wondering if it might be caused by RI triggers - maybe yes, but I'm
not aware of any RI trigger doing updates.

That being said, I think that what you're doing is wrong. If you think you
have a corrupted database, I'd strongly suggest doing dump/restore. Or how
do you know there's no other corruption lurking in the files, slowly
spreading to other parts of the database?

Tomas

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

john gale

john@smadness.com

about 12 years ago

In reply to: Tomas Vondra (#3)

Re: cannot delete corrupted rows after DB corruption: tuple concurrently updated

On Feb 26, 2014, at 2:59 AM, Tomas Vondra <tv@fuzzy.cz> wrote:

On 26 Únor 2014, 8:45, john gale wrote:

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR: tuple concurrently updated

AFAIK this error is raised when a before trigger modifies the row that is
being deleted. Imagine a trigger that does this

UPDATE testruns SET mycolumn = 1 WHERE id = OLD.id;
RETURN OLD;

Given the way MVCC in postgres works (copying row when updating), the
error makes sense. In 9.0 this worked by silently skipping the DELETE
(incidentally, I had a few reports about tables that can't be deleted
because of this in the past month).

Anyway, do you have any triggers on the table? If yes, try to disable
them. I suspect the data are corrupted in a way that causes update on the
deleted row - either directly, or maybe because of a cascading effect.

There were a few triggers auto-created by a foreign key constraint but we removed the constraint, which removed the triggers:

The tuple error still exists, however:

munin2=# delete from testruns where ctid = '(37069305,4)';
ERROR: tuple concurrently updated
munin2=# select id from testruns where ctid = '(37069305,4)';
id
-----------
141908486
(1 row)

munin2=# delete from testruns where id = 141908486;
ERROR: tuple concurrently updated
munin2=# select * from testruns where id = 141908486;
ERROR: unexpected chunk number 0 (expected 1) for toast value 155900302 in pg_toast_16822

I'm wondering if it might be caused by RI triggers - maybe yes, but I'm
not aware of any RI trigger doing updates.

That being said, I think that what you're doing is wrong. If you think you
have a corrupted database, I'd strongly suggest doing dump/restore. Or how
do you know there's no other corruption lurking in the files, slowly
spreading to other parts of the database?

We're aware that we're shoveling dirt to patch a crack in a large highway. However at the moment we value uptime rather than strict integrity of the data (141mil rows allows some wiggle room), and since we don't modify these rows after they're inserted, I can't imagine how this kind of row corruption can "slowly spread to other parts of the database".

~ john