Partially corrupted table

Started by Filip Hrbekover 19 years ago10 messagesbugs
Jump to latest
#1Filip Hrbek
filip.hrbek@plz.comstar.cz

Platform:
CentOS release 4.3 (Final) (Linux 2.6.9-34.EL)

Database version:
PostgreSQL 8.1.3 on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)

Description:
One of cca 100 tables is partially corrupted. An attempt to read or dump the data from the table is sometimes successful, sometimes crashes.
After upgrade to 8.1.4 the behaviour remained unchanged (using the cluster created with 8.1.3). Unfortunately, I am not able to reproduce the error from the beginning using ONLY 8.1.4.

Note:
After successful dump/reload everything is OK. The problem is not in the data content itself, but in the binary database cluster. This is why I would like to send you the whole cluster instead of the database dump as an attachment. The problem is in the file size (30M). Please tell me where or how to send it. For simplicity, I removed all other objects from the database. There is only one table with several indexes, the table contains 56621 rows.

Here are some examples of the bevaviour:

[root@devlin2 tmp]# pg_dumpall -p5447 -U postgres > pgdump.sql
pg_dump: ERROR: invalid memory alloc request size 4294967290
pg_dump: SQL command to dump the contents of table "fct" failed: PQendcopy() failed.
pg_dump: Error message from server: ERROR: invalid memory alloc request size 4294967290
pg_dump: The command was: COPY dwhdata_salemc.fct (time_id, company_id, customer_id, product_id, flagsprod_id, flagssale_id, account_id, accttime_id, invcustomer_id, salesperson_id, vendor_id, inv_cost_amt, inv_base_amt, inv_amt, inv_qty, inv_wght, ret_cost_amt, ret_base_amt, ret_amt, ret_qty, ret_wght, unret_cost_amt, unret_base_amt, unret_amt, unret_qty, unret_wght, bonus_forecast, bonus_final, stamp_code) TO stdout;
pg_dumpall: pg_dump failed on database "dwhdb", exiting

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
SELECT

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 119264, maximum size 8136

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 38788, maximum size 8136

AFTER UPGRADE TO 8.1.4:
dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 52892, maximum size 8136

I noticed one more problem when executing vacuum:
dwhdb=# vacuum full;
WARNING: relation "pg_attribute" page 113 is uninitialized --- fixing
VACUUM

The "vacuum" problem has happend only once.

Regards
Fililp Hrbek

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Filip Hrbek (#1)
Re: Partially corrupted table

"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes:

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
SELECT

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 119264, maximum size 8136

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 38788, maximum size 8136

I think you've got hardware problems. Getting different answers from
successive scans of the same table is really hard to explain any other
way. memtest86 and badblocks are commonly suggested for testing memory
and disk respectively on Linux machines.

regards, tom lane

#3Filip Hrbek
filip.hrbek@plz.comstar.cz
In reply to: Filip Hrbek (#1)
Re: Partially corrupted table

No, I'm sure is it not a HW problem. I tested the same DB cluster on two
different machines. The error is exactly the same.
I can send you the cluster if you tell me how and where to send 30M file.

Thanks for reply
Filip Hrbek

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>
Cc: <pgsql-bugs@postgresql.org>
Sent: Tuesday, August 29, 2006 4:01 PM
Subject: Re: [BUGS] Partially corrupted table

Show quoted text

"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes:

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
SELECT

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 119264, maximum size 8136

dwhdb=# create temp table t_fct as select * from dwhdata_salemc.fct;
ERROR: row is too big: size 38788, maximum size 8136

I think you've got hardware problems. Getting different answers from
successive scans of the same table is really hard to explain any other
way. memtest86 and badblocks are commonly suggested for testing memory
and disk respectively on Linux machines.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Filip Hrbek (#3)
Re: Partially corrupted table

"Filip Hrbek" <filip.hrbek@plz.comstar.cz> writes:

No, I'm sure is it not a HW problem. I tested the same DB cluster on two
different machines. The error is exactly the same.

Really?

I can send you the cluster if you tell me how and where to send 30M file.

You can email it to me personally if you want. What's the exact
hardware/OS platform this is on, again?

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Filip Hrbek (#1)
Re: Partially corrupted table

Well, it's a corrupt-data problem all right. The tuple that's
causing the problem is on page 1208, item 27:

Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED
XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0
Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28
infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)

0578: 5d140000 d6240200 00000000 00000000 ]....$..........
0588: 0000b804 1b001d00 02091c00 0e000000 ................
0598: 02000000 42020000 23040000 6b000000 ....B...#...k...
05a8: 02000000 6a010000 0d000000 42020000 ....j.......B...
05b8: 02000000 10000000 08000000 00000400 ................
05c8: 08000000 00000400 0a000000 ffff0400 ................
05d8: 78050000 0a000000 00000200 03000000 x...............
05e8: 08000000 00000300 08000000 00000400 ................
05f8: 08000000 00000400 08000000 00000400 ................
0608: 08000000 00000200 08000000 00000300 ................
0618: 08800000 00000400 08000000 00000400 ................
^^^^^^^^
0628: 08000000 00000400 08000000 00000200 ................
0638: 08000000 00000300 08000000 00000400 ................
0648: 08000000 00000400 18000000 494e565f ............INV_
0658: 41534153 5f323030 36303130 31202020 ASAS_20060101

The underlined word is a field length word that evidently should contain
8, but contains hex 8008. This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land. Since the
results will depend on which shared buffer the page happens to be in and
what happens to be at the address the step lands at, the inconsistent
results from try to try are not so surprising.

The next question is how did it get that way. In my experience a
single-bit flip like that is most likely to be due to flaky memory,
though bad motherboards or cables are not out of the question either.
I'd recommend some thorough hardware testing on the original machine.

It seems there's only the one bad bit; I did

dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
DELETE 1

and then was able to copy the table repeatedly without crash. I'd
suggest doing that and then reconstructing the deleted tuple from
the above dump.

regards, tom lane

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#5)
Re: Partially corrupted table

Tom Lane wrote:

The underlined word is a field length word that evidently should contain
8, but contains hex 8008. This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land. Since the
results will depend on which shared buffer the page happens to be in and
what happens to be at the address the step lands at, the inconsistent
results from try to try are not so surprising.

Hmm, perhaps we could protect against "impossible" length words? (i.e.
those that would cause us try to read past the end of the current
buffer).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#6)
Re: Partially corrupted table

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

The underlined word is a field length word that evidently should contain
8, but contains hex 8008. This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land.

Hmm, perhaps we could protect against "impossible" length words?

Perhaps. I'm hesitant to add additional tests into the inner loop of
heap_deform_tuple and friends though. It's not like it's going to do
anything to recover your data post-corruption :-(

regards, tom lane

#8Filip Hrbek
filip.hrbek@plz.comstar.cz
In reply to: Filip Hrbek (#1)
Re: Partially corrupted table

Tom, thank you very much for your excellent and fast analysis (I mean it
seriously, I am comparing your help to IBM Informix commercial support
:-) ).

It is possible that the corruption was caused by a HW problem at customer's
server, and then this problem appeared also at our development environment
because of the data already beeing corrupted. I will recommend the customer
to make some memory tests.

We are using PostgreSQL at 14 customer servers for almost 5 years and this
is the first time it crashed - and perhaps due to a HW problem. Great work!

Regards
Filip Hrbek

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>
Cc: <pgsql-bugs@postgreSQL.org>
Sent: Wednesday, August 30, 2006 1:33 AM
Subject: Re: [BUGS] Partially corrupted table

Show quoted text

Well, it's a corrupt-data problem all right. The tuple that's
causing the problem is on page 1208, item 27:

Item 27 -- Length: 240 Offset: 1400 (0x0578) Flags: USED
XMIN: 5213 CMIN: 140502 XMAX: 0 CMAX|XVAC: 0
Block Id: 1208 linp Index: 27 Attributes: 29 Size: 28
infomask: 0x0902 (HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)

0578: 5d140000 d6240200 00000000 00000000 ]....$..........
0588: 0000b804 1b001d00 02091c00 0e000000 ................
0598: 02000000 42020000 23040000 6b000000 ....B...#...k...
05a8: 02000000 6a010000 0d000000 42020000 ....j.......B...
05b8: 02000000 10000000 08000000 00000400 ................
05c8: 08000000 00000400 0a000000 ffff0400 ................
05d8: 78050000 0a000000 00000200 03000000 x...............
05e8: 08000000 00000300 08000000 00000400 ................
05f8: 08000000 00000400 08000000 00000400 ................
0608: 08000000 00000200 08000000 00000300 ................
0618: 08800000 00000400 08000000 00000400 ................
^^^^^^^^
0628: 08000000 00000400 08000000 00000200 ................
0638: 08000000 00000300 08000000 00000400 ................
0648: 08000000 00000400 18000000 494e565f ............INV_
0658: 41534153 5f323030 36303130 31202020 ASAS_20060101

The underlined word is a field length word that evidently should contain
8, but contains hex 8008. This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land. Since the
results will depend on which shared buffer the page happens to be in and
what happens to be at the address the step lands at, the inconsistent
results from try to try are not so surprising.

The next question is how did it get that way. In my experience a
single-bit flip like that is most likely to be due to flaky memory,
though bad motherboards or cables are not out of the question either.
I'd recommend some thorough hardware testing on the original machine.

It seems there's only the one bad bit; I did

dwhdb=# delete from dwhdata_salemc.fct where ctid = '(1208,27)';
DELETE 1

and then was able to copy the table repeatedly without crash. I'd
suggest doing that and then reconstructing the deleted tuple from
the above dump.

regards, tom lane

#9Filip Hrbek
filip.hrbek@plz.comstar.cz
In reply to: Filip Hrbek (#1)
Re: Partially corrupted table

Well it would be a good idea to check the row validity. If the row is
partially corrupted, I can get an error (which is quite good) but I can get
bad results without an error as well, which is really bad.

After Tom decoded which row yields the problem, I tried to select the data
from the corrupted row:

dwhdb=# select * from dwhdata_salemc.fct where time_id = 14 and company_id =
2 and customer_id = 578 and product_id = 1059 and flagsprod_id = 107 and
flagssale_id = 2 and account_id = 362 and accttime_id = 13 and
invcustomer_id = 578 and salesperson_id = 2 and vendor_id = 16;
time_id | company_id | customer_id | product_id | flagsprod_id |
flagssale_id | account_id | accttime_id | invcustomer_id | salesperson_id |
vendor_id | inv_cost_amt | inv_base_amt | inv_amt | inv_qty | inv_wght |
ret_cost_amt | ret_base_amt | ret_amt | ret_qty | ret_wght | unret_cost_amt
| unret_base_amt | unret_amt |
unret_qty | unret_wght | bonus_forecast | bonus_final |
stamp_code
---------+------------+-------------+------------+--------------+--------------+------------+-------------+----------------+----------------+-----------+--------------+--------------+---------+---------+----------+--------------+--------------+---------+---------+----------+----------------+-----------------------------------+-----------------------------------+-----------------------------------+------------+----------------+-------------+----------------------
14 | 2 | 578 | 1059 | 107 |
2 | 362 | 13 | 578 | 2 | 16 |
0.0000 | 0.0000 | 0.1400 | 3.00 | 0.000 | 0.0000 |
0.0000 | 0.0000 | 0.00 | 0.000 | 8.0000 |
000040008000000000004000800000000
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| 000040008000000000004000800000000 | 000040008000000000004000800000000 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0.000 | 0.0000 | 0.0000 | INV_ASAS_20060601
^^^^^^^^^^^^^^^^^^^
(1 row)

As you can see, the underlined data is bad, but I get no error report. There
should be all zeros in the number fields, and 'INV_ASAS_20060101' instead of
'INV_ASAS_20060601' in the stamp field. This all is caused by a single bit
being inverted. I can see it very dangerous.

Regards
Filip Hrbek

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Alvaro Herrera" <alvherre@commandprompt.com>
Cc: "Filip Hrbek" <filip.hrbek@plz.comstar.cz>; <pgsql-bugs@postgreSQL.org>
Sent: Wednesday, August 30, 2006 4:10 AM
Subject: Re: [BUGS] Partially corrupted table

Show quoted text

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

The underlined word is a field length word that evidently should contain
8, but contains hex 8008. This causes the tuple-data decoder to step
way past the end of the tuple and off into never-never land.

Hmm, perhaps we could protect against "impossible" length words?

Perhaps. I'm hesitant to add additional tests into the inner loop of
heap_deform_tuple and friends though. It's not like it's going to do
anything to recover your data post-corruption :-(

regards, tom lane

#10Bruno Wolff III
bruno@wolff.to
In reply to: Filip Hrbek (#8)
Re: Partially corrupted table

On Wed, Aug 30, 2006 at 10:18:46 +0200,
Filip Hrbek <filip.hrbek@plz.comstar.cz> wrote:

It is possible that the corruption was caused by a HW problem at customer's
server, and then this problem appeared also at our development environment
because of the data already beeing corrupted. I will recommend the customer
to make some memory tests.

You might also ask them if they are using ECC memory. A single bit flip
could have easily happened with good memory that didn't have error correction.