Rows missing from table despite FK constraint

Started by Konrad Garusover 16 years ago35 messagesgeneral
Jump to latest
#1Konrad Garus
konrad.garus@gmail.com

Hello,

We use PG 8.3. We use pg_dump and pg_restore overnight to create
copies of main database for reporting etc. One dump/restore runs at 9
PM, another at 11 PM.

Today I discovered that the restore at 11 PM failed to recreate a
foreign key constraint, because one row from master table was missing.
It is also missing from main database, but not from the 9 PM dump.

The main database is in curious state: The row from master table is
missing, the row referencing it from slave table is present, and
finally the FK constraint on slave is in place.

Do you have any ideas on how it could possibly happen? What research
could help find the root cause and fix the database?

Thanks.

--
Konrad Garus

#2Konrad Garus
konrad.garus@gmail.com
In reply to: Konrad Garus (#1)
Re: Rows missing from table despite FK constraint

One more bit of information. The master table has exactly 3 rows
missing. They were all inserted half a year ago within one minute and
no other rows have been inserted in between them.

Is it possible that we lost a consistent piece of data (like a block
or a page)? What can I do to track it down?

I suspect it is a bug in PostgreSQL. Does it resemble an already
submitted issue?

--
Konrad Garus

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Konrad Garus (#1)
Re: Rows missing from table despite FK constraint

Konrad Garus wrote:

We use PG 8.3. We use pg_dump and pg_restore overnight to create
copies of main database for reporting etc. One dump/restore runs at 9
PM, another at 11 PM.

Today I discovered that the restore at 11 PM failed to recreate a
foreign key constraint, because one row from master table was missing.
It is also missing from main database, but not from the 9 PM dump.

The main database is in curious state: The row from master table is
missing, the row referencing it from slave table is present, and
finally the FK constraint on slave is in place.

Do you have any ideas on how it could possibly happen? What research
could help find the root cause and fix the database?

If that's really the case, it sounds like curruption.

Is there anything in the server logs?

Yours,
Laurenz Albe

#4Konrad Garus
konrad.garus@gmail.com
In reply to: Laurenz Albe (#3)
Re: Rows missing from table despite FK constraint

2010/1/7 Albe Laurenz <laurenz.albe@wien.gv.at>:

If that's really the case, it sounds like curruption.

Is there anything in the server logs?

I'm pretty sure it is corruption. With FK in place I don't see any
other way how it would be possible.

The log is too large to read through, but I haven't spotted anything
suspicious (searching by table name, grepping for error and vacuum,
quick scan from when the state was consistent until when it was not).

Also, reindex on the master table did not help.

Is there anything I could do on logs or the database that could help
diagnose the issue?

--
Konrad Garus

#5Guillaume Lelarge
guillaume@lelarge.info
In reply to: Konrad Garus (#1)
Re: Rows missing from table despite FK constraint

Le 07/01/2010 11:12, Konrad Garus a �crit :

Hello,

We use PG 8.3. We use pg_dump and pg_restore overnight to create
copies of main database for reporting etc. One dump/restore runs at 9
PM, another at 11 PM.

Today I discovered that the restore at 11 PM failed to recreate a
foreign key constraint, because one row from master table was missing.
It is also missing from main database, but not from the 9 PM dump.

The main database is in curious state: The row from master table is
missing, the row referencing it from slave table is present, and
finally the FK constraint on slave is in place.

Do you have any ideas on how it could possibly happen?

Disabling trigger does this. Here is a quick (but long) example:

guillaume@laptop:~$ createdb b1

guillaume@laptop:~$ LANG=C psql b1
psql (8.5devel)
Type "help" for help.

b1=# create table t1 (c1 integer primary key, c2 integer);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "t1_pkey"
for table "t1"
CREATE TABLE

b1=# create table t2 (c1 integer, c1_t1 integer references t1(c1));
CREATE TABLE

b1=# \d t1
Table "public.t1"
Column | Type | Modifiers
--------+---------+-----------
c1 | integer | not null
c2 | integer |
Indexes:
"t1_pkey" PRIMARY KEY, btree (c1)
Referenced by:
TABLE "t2" CONSTRAINT "t2_c1_t1_fkey" FOREIGN KEY (c1_t1) REFERENCES
t1(c1)

b1=# \d t2
Table "public.t2"
Column | Type | Modifiers
--------+---------+-----------
c1 | integer |
c1_t1 | integer |
Foreign-key constraints:
"t2_c1_t1_fkey" FOREIGN KEY (c1_t1) REFERENCES t1(c1)

b1=# insert into t1 values (1, 100);
INSERT 0 1

b1=# insert into t1 values (2, 200);
INSERT 0 1

b1=# insert into t2 values (1, 1);
INSERT 0 1

b1=# insert into t2 values (1, 2);
INSERT 0 1

b1=# insert into t2 values (1, 3);
ERROR: insert or update on table "t2" violates foreign key constraint
"t2_c1_t1_fkey"
D�TAIL : Key (c1_t1)=(3) is not present in table "t1".

Which is right. Now, we disable triggers:

b1=# alter table t2 disable trigger all;
ALTER TABLE
b1=# insert into t2 values (1, 3);
INSERT 0 1

The INSERT now works. The FK is not checked.

b1=# alter table t2 enable trigger all;
ALTER TABLE

Reenabling triggers won't alert you. Rows are inserted and will stay
that way.

b1=# select * from t1;
c1 | c2
----+-----
1 | 100
2 | 200
(2 lines)

b1=# select * from t2;
c1 | c1_t1
----+-------
1 | 1
1 | 2
1 | 3
(3 lines)

Despite my examples are on 8.5dev, you have the same issue with 8.3. See
http://www.postgresql.org/docs/8.3/interactive/sql-altertable.html for
more details.

So, question is: did you disable triggers sometime on the referenced table?

--
Guillaume.
http://www.postgresqlfr.org
http://dalibo.com

#6Konrad Garus
konrad.garus@gmail.com
In reply to: Guillaume Lelarge (#5)
Re: Rows missing from table despite FK constraint

So, question is: did you disable triggers sometime on the referenced table?

No, at least not intentionally. More information:

1. Missing are 3 rows added quickly one after another over half a
year ago. They were lost this week. That is the only corruption I am
aware of.

2. The problem is rows that were present in master table and are
gone, not superfluous rows in slave table.

3. The master table has this rule preventing DELETE on it:

Rules:
master_table_no_delete AS
ON DELETE TO master_table DO INSTEAD SELECT no_delete() AS no_delete

Where no_delete is:

begin
raise exception 'Cannot remove rows from the table.';
end;

I am pretty confident it was not caused by us or our software. The
data is relatively old, the missing piece is small and the no_delete
rule is in place.

Can it be a side effect of some PG job, such as vacuum etc?

Let me know what other information could be helpful.

--
Konrad Garus

#7Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Konrad Garus (#1)
Re: Rows missing from table despite FK constraint

On 7 Jan 2010, at 11:12, Konrad Garus wrote:

Hello,

We use PG 8.3. We use pg_dump and pg_restore overnight to create
...

You seem to have lost the actual data, not the index entries pointing to it, or a sequential scan (eg. pg_dump) would still have found your rows.

Do you have any ideas on how it could possibly happen? What research
could help find the root cause and fix the database?

What kind of file-system is the affected table on? - and while we're at it, what OS/Distribution and version? Is your data on some kind of RAID array? If so, what type (hardware/software, RAID type)?

I get the impression the data you lost and the data around it hasn't been written to in a long time; it wouldn't surprise me if your problem would have been caused by a bad sector on a disk, but that depends on how reliable your storage is set up to be.

Bad memory is another typical cause of corruption, but not likely in your case.

And of course there could be a bug in PG; are you up to date on the minor versions?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4b47130010732637119309!

#8Konrad Garus
konrad.garus@gmail.com
In reply to: Alban Hertroys (#7)
Re: Rows missing from table despite FK constraint

2010/1/8 Alban Hertroys <dalroi@solfertje.student.utwente.nl>:

You seem to have lost the actual data, not the index entries pointing to it, or a sequential scan (eg. pg_dump) would still have found your rows.

I agree.

What kind of file-system is the affected table on? - and while we're at it, what OS/Distribution and version? Is your data on some kind of RAID array? If so, what type (hardware/software, RAID type)?

It's ext3 on a hardware RAID1. The array is in perfect condition,
according to its diag tool. The OS is Ubuntu 8.04. The exact PG
version is 8.3.8.

I get the impression the data you lost and the data around it hasn't been written to in a long time; it wouldn't surprise me if your problem would have been caused by a bad sector on a disk, but that depends on how reliable your storage is set up to be.

You are correct about the first point. It's a write-only table with
thousands of inserts daily, and the lost rows were written 7 months
ago.

--
Konrad Garus

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#8)
Re: Rows missing from table despite FK constraint

Konrad Garus <konrad.garus@gmail.com> writes:

2010/1/8 Alban Hertroys <dalroi@solfertje.student.utwente.nl>:

I get the impression the data you lost and the data around it hasn't been written to in a long time; it wouldn't surprise me if your problem would have been caused by a bad sector on a disk, but that depends on how reliable your storage is set up to be.

You are correct about the first point. It's a write-only table with
thousands of inserts daily, and the lost rows were written 7 months
ago.

Do you know that the rows disappeared recently?

regards, tom lane

#10Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#9)
Re: Rows missing from table despite FK constraint

2010/1/8 Tom Lane <tgl@sss.pgh.pa.us>:

Do you know that the rows disappeared recently?

Yes. They are present in dump from 9 PM and missing from dump from 1
AM. It must've happened within this 4-hour window.

--
Konrad Garus

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#10)
Re: Rows missing from table despite FK constraint

Konrad Garus <konrad.garus@gmail.com> writes:

2010/1/8 Tom Lane <tgl@sss.pgh.pa.us>:

Do you know that the rows disappeared recently?

Yes. They are present in dump from 9 PM and missing from dump from 1
AM. It must've happened within this 4-hour window.

Hm. It would be interesting to see if you can find the place where the
rows had been and dump it with pg_filedump
http://sources.redhat.com/rhdb/
You can look at the ctid column of the rows that are adjacent to the
missing ones according to your older dump, and then dump out those
blocks (I recommend -i -f style).

regards, tom lane

#12Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#11)
Re: Rows missing from table despite FK constraint

How shall I do it? Is this correct:

1. Run:

select ctid, * from attachment where ...

on the table with such a WHERE clause that includes rows around the
missing ones. ctid around missing rows seems to be (603712,78) and
(603714,1). Note that 603713 is missing.

2. Run:

select relfilenode from pg_class where relname = 'attachment';"

(returns 922494)

3. Run:

pg_filedump -i -f -R 603712 603714
/var/lib/postgresql/8.3/main/base/922438/922494 > myfile

I'm unsure about it, because the resulting file does not seem to have
the rows I saw listed for block 603712 or 603714. I checked by text in
VARCHAR columns.

--
Konrad Garus

#13Konrad Garus
konrad.garus@gmail.com
In reply to: Konrad Garus (#12)
Re: Rows missing from table despite FK constraint

2010/1/8 Konrad Garus <konrad.garus@gmail.com>:

I'm unsure about it, because the resulting file does not seem to have
the rows I saw listed for block 603712 or 603714. I checked by text in
VARCHAR columns.

I must've done something wrong.

I found that row listed at:

Block 603712 ********************************************************
(header etc.)
Item 15 -- Length: 184 Offset: 5496 (0x1578) Flags: NORMAL
XMIN: 8124 XMAX: 0 CID|XVAC: 0
Block Id: 79424 linp Index: 15 Attributes: 19 Size: 32
infomask: 0x0903 (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_INVALID)
t_bits: [0]: 0xff [1]: 0x79 [2]: 0x07

Has ctid equal (79424,15).

How do I run pg_filedump for rows with ctid between (603712,78) and (603714,1)?

--
Konrad Garus

#14Konrad Garus
konrad.garus@gmail.com
In reply to: Konrad Garus (#13)
Re: Rows missing from table despite FK constraint

OK, I got it. Attached is the dump of the missing block.

--
Konrad Garus

Attachments:

missing_blockapplication/octet-stream; name=missing_blockDownload
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#14)
Re: Rows missing from table despite FK constraint

Konrad Garus <konrad.garus@gmail.com> writes:

OK, I got it. Attached is the dump of the missing block.

So which of these rows are invisible? According to the flags
items 1, 3 and 5 should be visible while 2 and 4 are dead versions
(of 3 and 5 respectively).

regards, tom lane

#16Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#15)
Re: Rows missing from table despite FK constraint

2010/1/8 Tom Lane <tgl@sss.pgh.pa.us>:

So which of these rows are invisible?  According to the flags
items 1, 3 and 5 should be visible while 2 and 4 are dead versions
(of 3 and 5 respectively).

All 3 are invisible, and at the same time they are the only 3 rows
missing from the table.

--
Konrad Garus

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#16)
Re: Rows missing from table despite FK constraint

Konrad Garus <konrad.garus@gmail.com> writes:

2010/1/8 Tom Lane <tgl@sss.pgh.pa.us>:

So which of these rows are invisible? �According to the flags
items 1, 3 and 5 should be visible while 2 and 4 are dead versions
(of 3 and 5 respectively).

All 3 are invisible, and at the same time they are the only 3 rows
missing from the table.

Huh. Nothing obviously wrong with the data ... maybe an xid wraparound
issue? What's your current XID counter? (pg_controldata is the easiest
way to answer that)

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: Rows missing from table despite FK constraint

I wrote:

Huh. Nothing obviously wrong with the data ... maybe an xid wraparound
issue? What's your current XID counter? (pg_controldata is the easiest
way to answer that)

Also, what are the XMINs of the non-missing tuples in the adjacent
blocks?

regards, tom lane

#19Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#18)
Re: Rows missing from table despite FK constraint

2010/1/8 Tom Lane <tgl@sss.pgh.pa.us>:

Also, what are the XMINs of the non-missing tuples in the adjacent
blocks?

# /usr/lib/postgresql/8.3/bin/pg_controldata /var/lib/postgresql/8.3/main/
pg_control version number: 833
Catalog version number: 200711281
Database system identifier: 5246886698902745063
Database cluster state: in production
pg_control last modified: Fri 08 Jan 2010 10:20:56 AM CST
Latest checkpoint location: 1D6/186B6BA0
Prior checkpoint location: 1D6/165DAF60
Latest checkpoint's REDO location: 1D6/174C8FB8
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/83037806
Latest checkpoint's NextOID: 142180690
Latest checkpoint's NextMultiXactId: 2250472
Latest checkpoint's NextMultiOffset: 5954794
Time of latest checkpoint: Fri 08 Jan 2010 10:18:33 AM CST
Minimum recovery ending location: 0/0
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: en_US.UTF-8
LC_CTYPE: en_US.UTF-8

2 mins around the missing rows:

ctid | xmin | attachment_id
-------------+----------+---------------
(603712,67) | 17140362 | 15460680
(603712,69) | 17140363 | 15460871
(603712,71) | 17140364 | 15460681
(603712,73) | 17140368 | 15460872
(603712,75) | 17140369 | 15460682
(603712,78) | 17140373 | 15460873
(603714,1) | 17140379 | 15460685
(603714,3) | 17140380 | 15460473
(603714,5) | 17140381 | 15460875
(603714,7) | 17140382 | 15460686
(603714,9) | 17140383 | 15460474

--
Konrad Garus

#20Konrad Garus
konrad.garus@gmail.com
In reply to: Konrad Garus (#19)
Re: Rows missing from table despite FK constraint

Just a reminder - these rows are over 6 months old and were lost at
night when the system was lightly used.

--
Konrad Garus

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#19)
#22Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#21)
#23Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Konrad Garus (#22)
#24Konrad Garus
konrad.garus@gmail.com
In reply to: Alvaro Herrera (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#24)
#26Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#26)
#28Konrad Garus
konrad.garus@gmail.com
In reply to: Tom Lane (#27)
#29Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Konrad Garus (#28)
#30Konrad Garus
konrad.garus@gmail.com
In reply to: Adrian Klaver (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Konrad Garus (#28)
#32Alban Hertroys
dalroi@solfertje.student.utwente.nl
In reply to: Tom Lane (#27)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alban Hertroys (#32)
#34Konrad Garus
konrad.garus@gmail.com
In reply to: Alban Hertroys (#32)
#35Konrad Garus
konrad.garus@gmail.com
In reply to: Konrad Garus (#34)