BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

Started by Tambet Matiisenabout 15 years ago7 messagesbugs

tambet.matiisen@gmail.com

about 15 years ago

The following bug has been logged online:

Bug reference: 5929
Logged by: Tambet Matiisen
Email address: tambet.matiisen@gmail.com
PostgreSQL version: 8.4.7
Operating system: PostgreSQL 8.4.7 on i486-pc-linux-gnu, compiled by GCC
gcc-4.4.real (Debian 4.4.5-10) 4.4.5, 32-bit
Description: ERROR: found toasted toast chunk for toast value
260340218 in pg_toast_260339342
Details:

For a few days I've been getting this error from my nightly backup script:

Warning: pg_dump: SQL command failed pg_dump: Error message from server:
ERROR: found toasted toast chunk for toast value 260340218 in
pg_toast_260339342 pg_dump: The command was: COPY public.yhistud_urlcache
(id, url, params, sess_id, content) TO stdout; pg_dumpall: pg_dump failed on
database "yhistud", exiting
Warning: Failed to dump pgsql cluster

Yesterday I upgraded the server from 8.4.5 to 8.4.7, hoping that this error
will go away, but no success.

I've been getting occasional errors from backup script for several months,
but this time the error is recurring. I have upgraded Linux kernel to
2.6.32, hoping that maybe the problem is in software RAID driver, but no
changes, occasionally I still get errors. I still have to do memory test on
the server, but I doubt faulty memories are the problem, because otherwise
the server behaves well.

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 15 years ago

In reply to: Tambet Matiisen (#1)

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

"Tambet Matiisen" <tambet.matiisen@gmail.com> wrote:

For a few days I've been getting this error from my nightly backup
script:

Warning: pg_dump: SQL command failed pg_dump: Error message from
server: ERROR: found toasted toast chunk for toast value 260340218
in pg_toast_260339342 pg_dump: The command was: COPY
public.yhistud_urlcache (id, url, params, sess_id, content) TO
stdout; pg_dumpall: pg_dump failed on database "yhistud", exiting
Warning: Failed to dump pgsql cluster

So you don't have a current backup, and your database is corrupted.

(1) If you still have a backup from before you started getting
backup failures, keep it safe until everything has settled down and
is running well for several months.

(2) Stop PostgreSQL and do a full copy of the data directory and
everything under it to a backup medium or another machine. Keep
this copy safe for months, too.

Yesterday I upgraded the server from 8.4.5 to 8.4.7, hoping that
this error will go away, but no success.

Newer versions with more bug fixes may be less likely to contain
bugs which could cause corruption, but an upgrade like that is
unlikely to "heal" data which is already corrupted.

I've been getting occasional errors from backup script for several
months,

Do you know what those were?

I have upgraded Linux kernel to 2.6.32, hoping that maybe the
problem is in software RAID driver, but no changes, occasionally I
still get errors.

Occasionally get what errors?

I still have to do memory test on the server, but I doubt faulty
memories are the problem, because otherwise the server behaves
well.

So, no problems other than months of errors on backups? Never any
OS lockups, power losses, or other abrupt terminations of
operations?

Also, do you now or have you ever run the database with fsync = off
or full_page_writes = off?

It is very important to figure out how your data got corrupted;
otherwise you can't really trust this machine..

-Kevin

Tambet Matiisen

tambet.matiisen@gmail.com

about 15 years ago

In reply to: Kevin Grittner (#2)

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

Hi Kevin!

Thanks for your reply. You make me feel that this is more serious than I
thought.

This is development server, that is also used as pre-live server.
Pre-live database is restored from live database dump every night. So
far the errors have been in pre-live database, that's why I didn't worry
too much - it is anyway overwritten every night from backup. Usually the
next day error was gone. I mostly blamed badly timed backup and restore
scripts, although this shouldn't result in errors.

The errors started from 07.09.2010, when I was still running PostgreSQL
8.1. Few examples:

07.09.2010:
Warning: pg_dump: ERROR: could not open relation with OID 339815468
pg_dump: SQL command to dump the contents of table "kannete_read"
failed: PQendcopy() failed. pg_dump: Error message from server: ERROR:
could not open relation with OID 339815468 pg_dump: The command was:
COPY public.kannete_read (yhistu_id, kande_rea_id, kande_id, konto_nr,
alamkonto_nr, deebetsumma, kreeditsumma, deebetsaldo, kreeditsaldo,
alamkonto_deebetsaldo, alamkonto_kreeditsaldo, looja, loomise_aeg,
muutja, muutmise_aeg, kuupaev, kande_nr, kinnitatud, deebetprotsent,
kreeditprotsent) TO stdout; pg_dumpall: pg_dump failed on database
"korteriy_histu", exiting

19.09.2010:
Warning: pg_dump: ERROR: unexpected chunk number 926884437 (expected
514) for toast value 1736426835 pg_dump: SQL command to dump the
contents of table "failid" failed: PQendcopy() failed. pg_dump: Error
message from server: ERROR: unexpected chunk number 926884437 (expected
514) for toast value 1736426835 pg_dump: The command was: COPY
public.failid (faili_id, yhistu_id, perioodi_id, arve_id, dokumendi_id,
tyyp, sisu, laius, korgus, pikkus, faili_nimi, sisu_tyyp, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "arvetest", exiting

24.09.2010:
Warning: pg_dump: socket not open pg_dump: SQL command to dump the
contents of table "failid" failed: PQendcopy() failed. pg_dump: Error
message from server: socket not open pg_dump: The command was: COPY
public.failid (faili_id, yhistu_id, perioodi_id, arve_id, dokumendi_id,
tyyp, sisu, laius, korgus, pikkus, faili_nimi, sisu_tyyp, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "arvetest", exiting

9.11.2010:
Warning: pg_dump: Dumping the contents of table "maaramised" failed:
PQgetCopyData() failed. pg_dump: Error message from server: server
closed the connection unexpectedly This probably means the server
terminated abnormally before or while processing the request. pg_dump:
The command was: COPY public.maaramised (maaramise_id, kululiigi_id,
perioodi_id, yhistu_id, korteri_id, kogus, yhik, hind, summa, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "arvetest", exiting

More recently after I upgraded to 8.4, 11.02.2010:
Warning: pg_dump: SQL command failed pg_dump: Error message from server:
ERROR: compressed data is corrupt pg_dump: The command was: COPY
public.failid (faili_id, yhistu_id, perioodi_id, arve_id, dokumendi_id,
tyyp, sisu, laius, korgus, pikkus, faili_nimi, sisu_tyyp, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "korteriy_histu", exiting

The current error has occurred 3 days in a row - 13-15.03.2011:
Warning: pg_dump: SQL command failed pg_dump: Error message from server:
ERROR: found toasted toast chunk for toast value 260340218 in
pg_toast_260339342 pg_dump: The command was: COPY
public.yhistud_urlcache (id, url, params, sess_id, content) TO stdout;
pg_dumpall: pg_dump failed on database "yhistud", exiting

This time the error is not in pre-live database and therefore it doesn't
go away.

I have not noticed any unusual errors in other services. The server is
also running Subversion, Trac, Apache, Samba, MySQL, Oracle, Tomcat and
so on. PostgreSQL, Subversion, Trac and Apache+PHP are used actively
every day.

Both fsync and full_page_writes are on. OK, I don't have UPS for this
machine, but power has been stable. Current uptime is 32 days, which I
bet is from the last kernel update. I run Debian testing on that machine.

Currently I blame either faulty memory or faulty software RAID driver. I
can easily eliminate the memory cause by running memtest86 for few
hours. But how do I eliminate the software RAID driver? PostgreSQL has
always been solid for me, so I suspect it least, but you never know...

Now, off to buy UPS...

Tambet

Show quoted text

On 15.03.2011 19:47, Kevin Grittner wrote:

"Tambet Matiisen"<tambet.matiisen@gmail.com> wrote:

For a few days I've been getting this error from my nightly backup
script:

Warning: pg_dump: SQL command failed pg_dump: Error message from
server: ERROR: found toasted toast chunk for toast value 260340218
in pg_toast_260339342 pg_dump: The command was: COPY
public.yhistud_urlcache (id, url, params, sess_id, content) TO
stdout; pg_dumpall: pg_dump failed on database "yhistud", exiting
Warning: Failed to dump pgsql cluster

So you don't have a current backup, and your database is corrupted.

(1) If you still have a backup from before you started getting
backup failures, keep it safe until everything has settled down and
is running well for several months.

(2) Stop PostgreSQL and do a full copy of the data directory and
everything under it to a backup medium or another machine. Keep
this copy safe for months, too.

Yesterday I upgraded the server from 8.4.5 to 8.4.7, hoping that
this error will go away, but no success.

Newer versions with more bug fixes may be less likely to contain
bugs which could cause corruption, but an upgrade like that is
unlikely to "heal" data which is already corrupted.

I've been getting occasional errors from backup script for several
months,

Do you know what those were?

I have upgraded Linux kernel to 2.6.32, hoping that maybe the
problem is in software RAID driver, but no changes, occasionally I
still get errors.

Occasionally get what errors?

I still have to do memory test on the server, but I doubt faulty
memories are the problem, because otherwise the server behaves
well.

So, no problems other than months of errors on backups? Never any
OS lockups, power losses, or other abrupt terminations of
operations?

Also, do you now or have you ever run the database with fsync = off
or full_page_writes = off?

It is very important to figure out how your data got corrupted;
otherwise you can't really trust this machine..

-Kevin

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 15 years ago

In reply to: Tambet Matiisen (#3)

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

Tambet Matiisen <tambet.matiisen@gmail.com> wrote:

Pre-live database is restored from live database dump every night.

How is that done? A single pg_dump of the entire live database
restored using psql? Are both database servers at the same
PostgreSQL version?

So far the errors have been in pre-live database,

You're running pg_dump against a database you just restored from a
pg_dump image?

Usually the next day error was gone. I mostly blamed badly timed
backup and restore scripts, although this shouldn't result in
errors.

No it shouldn't -- if you're following any of the documented backup
and restore techniques. I have a suspicion that you're just doing a
file copy without stopping the live database or properly following
the documented PITR backup and recovery techniques.

The errors started from 07.09.2010, when I was still running
PostgreSQL 8.1. Few examples:

07.09.2010:
Warning: pg_dump: ERROR: could not open relation with OID
339815468

[additional errors which could be caused by copying a database
while running without proper PITR techniques]

The current error has occurred 3 days in a row - 13-15.03.2011:
Warning: pg_dump: SQL command failed pg_dump: Error message from
server:
ERROR: found toasted toast chunk for toast value 260340218 in
pg_toast_260339342

This time the error is not in pre-live database and therefore it
doesn't go away.

If I understand you, this sounds like corruption in the live
database; nothing on the pre-live database is part of causing this
problem.

The server is also running [...] Samba [...]

I hope you're not trusting Samba too far. For a while we were using
it in backups across our WAN, and it mangled at least one file
almost every day. We had to take to running md5sum against both
ends for each file to ensure we didn't get garbage (until we
converted everything to use TCP communications, which have never
mangled anything for us).

Both fsync and full_page_writes are on.

Good. Without those an OS or hardware crash can corrupt your
database.

OK, I don't have UPS for this machine, but power has been stable.
Current uptime is 32 days, which I bet is from the last kernel
update.

OK. A power outage wouldn't be too likely to matter if you have
fsync and full_page_writes on.

Currently I blame either faulty memory or faulty software RAID
driver. I can easily eliminate the memory cause by running
memtest86 for few hours

Is this ECC memory? If not, even a good test doesn't prove that a
RAM problem didn't cause the corruption.

Now, off to buy UPS...

Not a bad idea, but it doesn't sound like lack of that is likely to
have caused the corruption in your live database, based on the
settings you mentioned. (Assuming those settings are in use on the
live server.)

-Kevin

Tambet Matiisen

tambet.matiisen@gmail.com

about 15 years ago

In reply to: Kevin Grittner (#4)

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

On 16.03.2011 17:09, Kevin Grittner wrote:

Tambet Matiisen<tambet.matiisen@gmail.com> wrote:

Pre-live database is restored from live database dump every night.

How is that done? A single pg_dump of the entire live database
restored using psql? Are both database servers at the same
PostgreSQL version?

Yes, I use pg_dump on live server and the result is rdiff-backupped into
development server. Whole SQL dump is 12G without compression and the
rdiff delta is about 10-20MB every day. Then I drop pre-live database on
development server and recreate it using createdb and psql.

For a while development server was running 8.4 and live server 8.1. Now
both are 8.4, but this shouldn't matter, as I do backup and restore via SQL.

So far the errors have been in pre-live database,

You're running pg_dump against a database you just restored from a
pg_dump image?

Hmm, yeah. This sounds rather dumb, but haven't got to that yet.
Development server contains some additional databases as well, that do
not exist on live server.

Usually the next day error was gone. I mostly blamed badly timed
backup and restore scripts, although this shouldn't result in
errors.

No it shouldn't -- if you're following any of the documented backup
and restore techniques. I have a suspicion that you're just doing a
file copy without stopping the live database or properly following
the documented PITR backup and recovery techniques.

No, I don't do any advanced backup tricks. Just plain pg_dump and psql.

This time the error is not in pre-live database and therefore it

doesn't go away.

If I understand you, this sounds like corruption in the live
database; nothing on the pre-live database is part of causing this
problem.

This would be the case when I do filesystem level copy, but I do not.

The server is also running [...] Samba [...]

I hope you're not trusting Samba too far. For a while we were using
it in backups across our WAN, and it mangled at least one file
almost every day. We had to take to running md5sum against both
ends for each file to ensure we didn't get garbage (until we
converted everything to use TCP communications, which have never
mangled anything for us).

As I said, I'm using rdiff-backup to transfer pure SQL files.

Both fsync and full_page_writes are on.

Good. Without those an OS or hardware crash can corrupt your
database.

Actually they are commented out, but I suppose this means "on".

OK, I don't have UPS for this machine, but power has been stable.
Current uptime is 32 days, which I bet is from the last kernel
update.

OK. A power outage wouldn't be too likely to matter if you have
fsync and full_page_writes on.

That's a relief :).

Currently I blame either faulty memory or faulty software RAID
driver. I can easily eliminate the memory cause by running
memtest86 for few hours

Is this ECC memory? If not, even a good test doesn't prove that a
RAM problem didn't cause the corruption.

It's not ECC memory.

Now, off to buy UPS...

Not a bad idea, but it doesn't sound like lack of that is likely to
have caused the corruption in your live database, based on the
settings you mentioned. (Assuming those settings are in use on the
live server.)

Checked live server, it has also fsync=on and full_page_writes=on. But
it shouldn't matter, because backup of live server doesn't give any errors.

It is possible, that restore of pre-live database using psql lasts so
long, that backup of the same database using pg_dump is already kicking
in. But again, this shouldn't matter and it doesn't explain why the last
error is in another database, that hasn't changed for months.

Now I have to find time to run memtest.

Tambet

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 15 years ago

In reply to: Tambet Matiisen (#5)

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

Tambet Matiisen <tambet.matiisen@gmail.com> wrote:

On 16.03.2011 17:09, Kevin Grittner wrote:

Tambet Matiisen<tambet.matiisen@gmail.com> wrote:

Pre-live database is restored from live database dump every
night.

How is that done? A single pg_dump of the entire live database
restored using psql? Are both database servers at the same
PostgreSQL version?

Yes, I use pg_dump on live server and the result is
rdiff-backupped into development server. Whole SQL dump is 12G
without compression and the rdiff delta is about 10-20MB every
day. Then I drop pre-live database on development server and
recreate it using createdb and psql.

createdb, not initdb? I suggest you backup and delete everything in
the data directory, and start with initdb, and see whether the
problem still exists. If it goes away, the problem was in your
shared system tables. If it persists, the problem is in your backup
files, and I would try a delete and a fresh copy. If *that* fixes
it you know the problem was with rdiff-backup. (Of course, keeping
copies of things before the delete might provide useful forensic
information.)

For a while development server was running 8.4 and live server
8.1. Now both are 8.4, but this shouldn't matter, as I do backup
and restore via SQL.

I hope you were using the 8.4 version of pg_dump when you were in
the dual-version situation. Using the earlier version of pg_dump is
not guaranteed to provide a backup which can be cleanly installed on
a later version. That could *possibly* be related to current
problems.

You're running pg_dump against a database you just restored from
a pg_dump image?

Hmm, yeah. This sounds rather dumb, but haven't got to that yet.

Well, it might not be as dumb as you think, if it uncovered a
problem with your dump/restore process from live to pre-live.

Development server contains some additional databases as well,
that do not exist on live server.

So are you really using pg_dumpall or pg_dump?

Both fsync and full_page_writes are on.

Good. Without those an OS or hardware crash can corrupt your
database.

Actually they are commented out, but I suppose this means "on".

Yeah, they default to the safe setting.

It's not ECC memory.

Well, then there has been proven to be a non-negligible possibility
of occasional random bit-flips. Seriously, next time you upgrade,
make sure any database server has ECC RAM.

It is possible, that restore of pre-live database using psql lasts
so long, that backup of the same database using pg_dump is already
kicking in.

Hmmm... You might want to do enough logging of the processes to be
able to confirm or eliminate that possibility. Dumping an
incompletely-restored database might generate some odd errors.

-Kevin

Tambet Matiisen

tambet.matiisen@gmail.com

about 15 years ago

In reply to: Kevin Grittner (#6)

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

On 16.03.2011 22:29, Kevin Grittner wrote:

Tambet Matiisen<tambet.matiisen@gmail.com> wrote:

Yes, I use pg_dump on live server and the result is
rdiff-backupped into development server. Whole SQL dump is 12G
without compression and the rdiff delta is about 10-20MB every
day. Then I drop pre-live database on development server and
recreate it using createdb and psql.

createdb, not initdb? I suggest you backup and delete everything in
the data directory, and start with initdb, and see whether the
problem still exists. If it goes away, the problem was in your
shared system tables. If it persists, the problem is in your backup
files, and I would try a delete and a fresh copy. If *that* fixes
it you know the problem was with rdiff-backup. (Of course, keeping
copies of things before the delete might provide useful forensic
information.)

Yes, I use createdb to recreate just one database. I doubt backup files
could cause such an error, they are plain SQL files.

Today I got another error, so it seems to get worse:

Warning: pg_dump: WARNING: could not write block 188224 of base/2802415579/2802416218 DETAIL: Multiple failures --- write error might be permanent. pg_dump: SQL command failed pg_dump: Error message from server: ERROR: xlog flush request 200EB/9E4CD48 is not satisfied --- flushed only to CC/3F22EFB4 LINE 1: ...LECT tableoid, oid, nspname, (SELECT rolname FROM pg_catalog... ^ CONTEXT: writing block 188224 of relation base/2802415579/2802416218 pg_dump: The command was: SELECT tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = nspowner) AS rolname, nspacl FROM pg_namespace pg_dumpall: pg_dump failed on database "hekotekerp", exiting
Warning: Failed to dump pgsql cluster

Strange that I have no problems actually using that database.

For a while development server was running 8.4 and live server
8.1. Now both are 8.4, but this shouldn't matter, as I do backup
and restore via SQL.

I hope you were using the 8.4 version of pg_dump when you were in
the dual-version situation. Using the earlier version of pg_dump is
not guaranteed to provide a backup which can be cleanly installed on
a later version. That could *possibly* be related to current
problems.

I used 8.1 version of pg_dump previously, but had no problems with it.
Currently both are 8.4, so this is not a problem.

Development server contains some additional databases as well,
that do not exist on live server.

So are you really using pg_dumpall or pg_dump?

I'm using pg_dump on live server and pg_dumpall on development server.

It's not ECC memory.

Well, then there has been proven to be a non-negligible possibility
of occasional random bit-flips. Seriously, next time you upgrade,
make sure any database server has ECC RAM.

Thanks for a tip, will do that.

It is possible, that restore of pre-live database using psql lasts
so long, that backup of the same database using pg_dump is already
kicking in.

Hmmm... You might want to do enough logging of the processes to be
able to confirm or eliminate that possibility. Dumping an
incompletely-restored database might generate some odd errors.

Thanks Kevin for suggestions, investigating further...

Tambet