Duplicate values found when reindexing unique index
Hello --
I noticed recently some errors in my postgres log like the following:
ERROR: could not open segment 9 of relation 1663/16830/1384385 (target
block 776929292): No such file or directory
These are occurring at a rate of 1 every 2-3 days. But that rate has been
increasing.
After Googling around, I found this error could indicate index corruption.
I found that the relation with oid 1384385 was the following
index: "index_entry_on_guid_and_feed_id" UNIQUE, btree (guid, feed_id) -- on
a table named entry.
When I tried to rebuild that index, I got the following error:
prod_2=# reindex index index_entry_on_guid_and_feed_id;
ERROR: could not create unique index
DETAIL: Table contains duplicated values.
Now, if that index was in fact corrupted, the perhaps it is not such a
surprise that the duplicate values are there. The thing that makes this more
mysterious is that we have another unique index on entry(feed_id, guid) --
and I have no reason to believe it is corrupted (although I guess it still
could be).
Please note: I have previously reported another issue with duplicate values
making into into a table despite having a unique index in place (bug #3724).
This issue involves a completely different table.
We can stay running despite this error but I would very much like to track
down the cause as soon as possible. To avoid destroying any evidence, I'm
going to leave things intact and await further instructions.
Given my experience, the reliability of unique indexes is becoming somewhat
suspect. Please help. ;-)
thanks in advance,
Mason
"Mason Hale" <masonhale@gmail.com> writes:
Given my experience, the reliability of unique indexes is becoming somewhat
suspect. Please help. ;-)
Well, as in the previous report, there is not enough information here to
offer much chance of understanding what's going wrong.
Have you tried reindexing that other index with the same columns in the
other order? My guess is that there really are duplicate entries in the
table; if so the other one should fail too. If so, please try to
identify the duplicated values, along the lines of
select guid, feed_id from entry group by 1,2 having count(*) > 1
and show us the system columns (ctid,xmin,xmax,cmin,cmax) from the
tuples having duplicate value(s). Note that you should probably disable
indexscan and bitmapscan while doing this probing, so as not to have the
queries use the suspect indexes.
This is 8.2.5 right? Was the DB loaded fresh into 8.2.5, or was it
inherited from previous 8.2.x release(s)?
BTW, what are the datatypes of the index columns?
regards, tom lane
On Dec 30, 2007 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Mason Hale" <masonhale@gmail.com> writes:
Given my experience, the reliability of unique indexes is becoming
somewhat
suspect. Please help. ;-)
Well, as in the previous report, there is not enough information here to
offer much chance of understanding what's going wrong.
Please just tell me what information you need and I will provide what I can.
Have you tried reindexing that other index with the same columns in the
other order? My guess is that there really are duplicate entries in the
table; if so the other one should fail too. If so, please try to
identify the duplicated values, along the lines ofselect guid, feed_id from entry group by 1,2 having count(*) > 1
and show us the system columns (ctid,xmin,xmax,cmin,cmax) from the
tuples having duplicate value(s). Note that you should probably disable
indexscan and bitmapscan while doing this probing, so as not to have the
queries use the suspect indexes.
I found a single pair of rows that were duplicated. Interestingly it was not
just the guid and feed_id that were duplicated but all columns were
indentical, including the primary key, except an update_at column which is
automatically populated via a trigger (BEFORE UPDATE on entry FOR EACH ROW).
The duplicate data included a created_at column which defaults to now() --
that the two duplicate rows have exactly the same values strongly hints to
me that the duplicates were created during the same transaction.
Here's the system column data you requested.
id | ctid | xmin | xmax | cmin | cmax
-----------+--------------+------+------+------+------
151341072 | (1508573,11) | 2 | 0 | 19 | 0
151341072 | (1818219,11) | 2 | 0 | 19 | 0
(2 rows)
This is 8.2.5 right? Was the DB loaded fresh into 8.2.5, or was it
inherited from previous 8.2.x release(s)?
It is 8.2.5. It was loaded from a pg_dump from an 8.2.3 database into a
fresh 8.2.5 database on new hardware.
BTW, what are the datatypes of the index columns?
id integer not null (serial)
guid character varying not null ( no size limit defined )
feed_id integer not null
regards, tom lane
thanks for the help Tom.
I do want to clear out one of the offending duplicates and reindex. But I'll
wait to do that until I get the okay from you.
Mason
Trolling through my server log I found this error:
2007-12-30 20:02:08 CST (10.11.199.136) PANIC: right sibling's left-link
doesn't match
2007-12-30 20:02:08 CST (10.11.199.136) STATEMENT: update bdu.entry set
title=$1, author=$2, description_type=$3, description_length=$4,
description=$5, published_at=$6, republished_at=$7, link=$8,
link_page_id=$9, link_count=$10, enclosure=$11, enclosure_page_id=$12,
enclosure_count=$13 where id=$14
2007-12-30 20:02:08 CST () LOG: server process (PID 30004) was terminated
by signal 6
2007-12-30 20:02:08 CST () LOG: terminating any other active server
processes
This seems related to the entry table -- so I wonder if it is related to
this problem?
On Dec 30, 2007 8:23 PM, Mason Hale <masonhale@gmail.com> wrote:
Show quoted text
On Dec 30, 2007 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Mason Hale" <masonhale@gmail.com> writes:
Given my experience, the reliability of unique indexes is becoming
somewhat
suspect. Please help. ;-)
Well, as in the previous report, there is not enough information here to
offer much chance of understanding what's going wrong.Please just tell me what information you need and I will provide what I
can.Have you tried reindexing that other index with the same columns in the
other order? My guess is that there really are duplicate entries in thetable; if so the other one should fail too. If so, please try to
identify the duplicated values, along the lines ofselect guid, feed_id from entry group by 1,2 having count(*) > 1
and show us the system columns (ctid,xmin,xmax,cmin,cmax) from the
tuples having duplicate value(s). Note that you should probably disable
indexscan and bitmapscan while doing this probing, so as not to have the
queries use the suspect indexes.I found a single pair of rows that were duplicated. Interestingly it was
not just the guid and feed_id that were duplicated but all columns were
indentical, including the primary key, except an update_at column which is
automatically populated via a trigger (BEFORE UPDATE on entry FOR EACH ROW).The duplicate data included a created_at column which defaults to now() --
that the two duplicate rows have exactly the same values strongly hints to
me that the duplicates were created during the same transaction.Here's the system column data you requested.
id | ctid | xmin | xmax | cmin | cmax
-----------+--------------+------+------+------+------
151341072 | (1508573,11) | 2 | 0 | 19 | 0
151341072 | (1818219,11) | 2 | 0 | 19 | 0
(2 rows)This is 8.2.5 right? Was the DB loaded fresh into 8.2.5, or was it
inherited from previous 8.2.x release(s)?It is 8.2.5. It was loaded from a pg_dump from an 8.2.3 database into a
fresh 8.2.5 database on new hardware.BTW, what are the datatypes of the index columns?
id integer not null (serial)
guid character varying not null ( no size limit defined )
feed_id integer not nullregards, tom lane
thanks for the help Tom.
I do want to clear out one of the offending duplicates and reindex. But
I'll wait to do that until I get the okay from you.Mason
"Mason Hale" <masonhale@gmail.com> writes:
I found a single pair of rows that were duplicated. Interestingly it was not
just the guid and feed_id that were duplicated but all columns were
indentical, including the primary key, except an update_at column which is
automatically populated via a trigger (BEFORE UPDATE on entry FOR EACH ROW).
The duplicate data included a created_at column which defaults to now() --
that the two duplicate rows have exactly the same values strongly hints to
me that the duplicates were created during the same transaction.
Hmm, what sounds more likely to me is that both of those rows are
updated versions of the same original row.
Here's the system column data you requested.
id | ctid | xmin | xmax | cmin | cmax
-----------+--------------+------+------+------+------
151341072 | (1508573,11) | 2 | 0 | 19 | 0
151341072 | (1818219,11) | 2 | 0 | 19 | 0
(2 rows)
I wonder whether it's just a coincidence that these have the same offset
number...
The fact that both rows have been frozen (xmin=2) means we can't really
tell whether they originated in the same transaction or not.
Can you show us all the triggers on this table? Also, it would be real
interesting to see "pg_filedump -i -f" output for the two blocks in
question (1508573 and 1818219) --- see http://sources.redhat.com/rhdb/
to get a copy of pg_filedump.
regards, tom lane
Can you show us all the triggers on this table?
Here they are:
Triggers:
entry_correct_published_at_trigger BEFORE INSERT OR UPDATE ON entry FOR
EACH ROW EXECUTE PROCEDURE correct_published_at()
entry_feed_page_trigger BEFORE INSERT OR UPDATE ON entry FOR EACH ROW
EXECUTE PROCEDURE entry_feed_page_trigger()
entry_updated_at_trigger BEFORE UPDATE ON entry FOR EACH ROW EXECUTE
PROCEDURE update_updated_at_timestamp()
feed_entry_count_trigger AFTER INSERT ON entry FOR EACH ROW EXECUTE
PROCEDURE update_feed_entry_count()
Also, it would be real
interesting to see "pg_filedump -i -f" output for the two blocks in
question (1508573 and 1818219) --- see http://sources.redhat.com/rhdb/
to get a copy of pg_filedump.
I have downloaded, compiled and installed pg_filedump -- but I am not sure
how to determine which file I should have it dump. I am not very familiar
with the postgres file structure. Can you please provide some guidance? How
do I determine the correct file?
I've determined the relation "entry" has an oid = 16838 -- but the
/data/base/16830 directory contains 92 1GB files named 16838.[1-92]
I've tried:
pg_filedump -i -f -R 1508573 16838
and got the following error:
[postgres@prod-db-2 16830]$ pg_filedump -i -f -R 1508573 16838
*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 8.2.0
*
* File: 16838
* Options used: -i -f -R 1508573
*
* Dump created on: Sun Dec 30 23:18:01 2007
*******************************************************************
Error: Seek error encountered before requested start block <1508573>.
I tried a few other files in the 1-92 range with the same results.
Mason
"Mason Hale" <masonhale@gmail.com> writes:
I have downloaded, compiled and installed pg_filedump -- but I am not sure
how to determine which file I should have it dump. I am not very familiar
with the postgres file structure. Can you please provide some guidance? How
do I determine the correct file?
I've determined the relation "entry" has an oid = 16838 -- but the
/data/base/16830 directory contains 92 1GB files named 16838.[1-92]
In the first place, it's relfilenode not OID to look at --- these
are often the same but not always.
Once you've got past that, there are 131072 blocks per file segment,
so logical block 1508573 would translate to segment .11 block 66781.
pg_filedump is too stupid to do that arithmetic for you :-(
regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
Here's the system column data you requested.
id | ctid | xmin | xmax | cmin | cmax
-----------+--------------+------+------+------+------
151341072 | (1508573,11) | 2 | 0 | 19 | 0
151341072 | (1818219,11) | 2 | 0 | 19 | 0
(2 rows)I wonder whether it's just a coincidence that these have the same offset
number...
I can't imagine any Postgres bug which would depend on the offsets being the
same. But what I could imagine is filesystem corruption which copied the block
to someplace else in the table or possibly has even mapped the same block into
two different places in the table.
Can you unmount the filesystem and run "fsck -v -n" on it? Is this the only
duplicate record in the table? Are there any other records on either of these
blocks?
To answer the latter question I found a handy trick of converting the tid to a
"point" so I could refer to the block or offset. In 8.3 this looks like:
select ctid from foo where (ctid::text::point)[0] = 0;
But in 8.2 iirc you had to call the tid output function explicitly because
there was no cast to text.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
I wonder whether it's just a coincidence that these have the same offset
number...
I can't imagine any Postgres bug which would depend on the offsets
being the same. But what I could imagine is filesystem corruption
which copied the block to someplace else in the table or possibly has
even mapped the same block into two different places in the table.
That idea was in my mind too, but Mason stated that the rows showed
different "updated_at" values, so they couldn't be mirror images of that
sort. The pg_filedump output for the two blocks would be more conclusive
about this though --- I was expecting to pay attention to the whole
block contents not only the seemingly-dup rows.
regards, tom lane
On Mon, 2007-12-31 at 01:27 -0500, Tom Lane wrote:
Gregory Stark <stark@enterprisedb.com> writes:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:
I wonder whether it's just a coincidence that these have the same offset
number...I can't imagine any Postgres bug which would depend on the offsets
being the same. But what I could imagine is filesystem corruption
which copied the block to someplace else in the table or possibly has
even mapped the same block into two different places in the table.That idea was in my mind too, but Mason stated that the rows showed
different "updated_at" values, so they couldn't be mirror images of that
sort.
Tom,
I think you misread Mason's post of 20:23 GMT-6 where he says the
created_at values are the *same*, not different. Mason's previous bug
report 3724 also had duplicate rows with matching created_at values.
So mangling block numbers and re-writing blocks in the wrong place is a
possibility. Filesystem corruption is just one way that can occur.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
I think you misread Mason's post of 20:23 GMT-6 where he says the
created_at values are the *same*, not different. Mason's previous bug
report 3724 also had duplicate rows with matching created_at values.
Yes, to confirm, the created_at values are the same. The *only* values that
are different are updated_at and that value is set via trigger.
Another data point that may or may not be relevant: I've been working to set
up a warm standby server and have run into some weird behavior there as
well. The symptom is the server encounters an error during the file restore,
then pre-maturely exits recovery mode. I'm using the pg_standby program for
the restore_command in my recovery.conf on the standby server.
The error message from the standby server was:
Last week I had turned my attention away from the database restore, and
didn't notice that our standby server had exited recovery mode on 12/20.
Here's the last few lines from the log file (full log attached).
2007-12-20 04:11:43 CST () LOG: restored log file
"000000010000042200000057" from archive
2007-12-20 04:13:09 CST () LOG: restored log file
"000000010000042200000058" from archive
2007-12-20 04:14:40 CST () LOG: restored log file
"000000010000042200000059" from archive
2007-12-20 04:14:40 CST () LOG: invalid info bits 0001 in log file 1058,
segment 89, offset 0
2007-12-20 04:14:40 CST () LOG: redo done at 422/58FFEE38
2007-12-20 04:14:40 CST () LOG: restored log
file "000000010000042200000058" from archive
2007-12-20 04:14:40 CST () LOG: archive recovery complete
2007-12-20 04:24:57 CST () LOG: database system is ready
thanks,
Mason
"Mason Hale" <masonhale@gmail.com> writes:
I think you misread Mason's post of 20:23 GMT-6 where he says the
created_at values are the *same*, not different. Mason's previous bug
report 3724 also had duplicate rows with matching created_at values.
Yes, to confirm, the created_at values are the same. The *only* values that
are different are updated_at and that value is set via trigger.
Which still demonstrates that they aren't exact images. I'd still like
to see the pg_filedump output though ...
Last week I had turned my attention away from the database restore, and
didn't notice that our standby server had exited recovery mode on 12/20.
Here's the last few lines from the log file (full log attached).
2007-12-20 04:11:43 CST () LOG: restored log file
"000000010000042200000057" from archive
2007-12-20 04:13:09 CST () LOG: restored log file
"000000010000042200000058" from archive
2007-12-20 04:14:40 CST () LOG: restored log file
"000000010000042200000059" from archive
2007-12-20 04:14:40 CST () LOG: invalid info bits 0001 in log file 1058,
segment 89, offset 0
Do you by any chance still have 000000010000042200000058 and
000000010000042200000059 archived? If so it would be useful to
look at the first dozen lines of "od -x" dump of each of them.
regards, tom lane
On Dec 31, 2007 9:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Mason Hale" <masonhale@gmail.com> writes:
I think you misread Mason's post of 20:23 GMT-6 where he says the
created_at values are the *same*, not different. Mason's previous bug
report 3724 also had duplicate rows with matching created_at values.Yes, to confirm, the created_at values are the same. The *only* values
that
are different are updated_at and that value is set via trigger.
Which still demonstrates that they aren't exact images. I'd still like
to see the pg_filedump output though ...
I sent in previous email. You saw it, yes?
2007-12-20 04:11:43 CST () LOG: restored log file
"000000010000042200000057" from archive
2007-12-20 04:13:09 CST () LOG: restored log file
"000000010000042200000058" from archive
2007-12-20 04:14:40 CST () LOG: restored log file
"000000010000042200000059" from archive
2007-12-20 04:14:40 CST () LOG: invalid info bits 0001 in log file1058,
segment 89, offset 0
Do you by any chance still have 000000010000042200000058 and
000000010000042200000059 archived? If so it would be useful to
look at the first dozen lines of "od -x" dump of each of them.
Yes, I do. Here's the output:
[postgres@dev-db-2 wal_archive]$ od -x 000000010000042200000058 | head -n15
0000000 d05e 0002 0001 0000 0423 0000 0000 c100
0000020 f7df 472e e701 4728 0000 0100 2000 0000
0000040 a1db 81e6 0423 0000 0068 c000 0000 0000
0000060 0048 0000 002c 0000 0000 0000 0423 0000
0000100 0020 c100 0000 0000 0000 0000 0001 0000
0000120 0000 0000 780b 2ede 9f68 00f5 7834 0000
0000140 2f4d 0001 1f35 4774 0000 0000 0000 0000
0000160 0000 0000 0000 0000 0000 0000 0000 0000
*
0020000 d05e 0000 0001 0000 0422 0000 2000 5800
0020020 53c2 48bc 0422 0000 1fbc 5800 ce6f 2edd
0020040 003a 0000 001e 0000 0b00 0000 067f 0000
0020060 41be 0000 1ff8 0015 0000 0186 0007 0000
0020100 4337 000a 000c 008f 0122 0000 db84 d429
0020120 0422 0000 2010 5800 ce6f 2edd 003a 0000
[postgres@dev-db-2 wal_archive]$ od -x 000000010000042200000059 | head -n15
0000000 d05e 0001 0001 0000 006b 0000 6000 69dc
0000020 12ae 0000 6380 0024 0010 375a 21cd 1174
0000040 4001 0001 637c 0058 0010 375a 21cd 1174
0000060 4001 0001 6355 0010 0010 375a 21cd 1174
0000100 4001 0001 631d 005a 0010 375a 21cd 1174
0000120 4001 0001 62e8 001e 0010 375a 21cd 1174
0000140 4001 0001 629a 0019 0010 375a 21cd 1174
0000160 4001 0001 604a 0013 0010 375a 21cd 1174
0000200 4001 0001 6045 0025 0010 375a 21cd 1174
0000220 4001 0001 6044 0029 0010 375a 21cd 1174
0000240 4001 0001 603a 0005 0010 375a 21cd 1174
0000260 4001 0001 6020 0035 0010 375a 21cd 1174
0000300 4001 0001 600e 0050 0010 375a 21cd 1174
0000320 4001 0001 600d 0018 0010 375a 21cd 1174
0000340 4001 0001 600b 0033 0010 375a 21cd 1174
[postgres@dev-db-2 wal_archive]$
Mason
"Mason Hale" <masonhale@gmail.com> writes:
Can you show us all the triggers on this table?
Here they are:
Triggers:
entry_correct_published_at_trigger BEFORE INSERT OR UPDATE ON entry FOR
EACH ROW EXECUTE PROCEDURE correct_published_at()
entry_feed_page_trigger BEFORE INSERT OR UPDATE ON entry FOR EACH ROW
EXECUTE PROCEDURE entry_feed_page_trigger()
entry_updated_at_trigger BEFORE UPDATE ON entry FOR EACH ROW EXECUTE
PROCEDURE update_updated_at_timestamp()
feed_entry_count_trigger AFTER INSERT ON entry FOR EACH ROW EXECUTE
PROCEDURE update_feed_entry_count()
Actually I wanted to see the function bodies ...
regards, tom lane
Tom, I'll send these to you privately.
Mason
On Dec 31, 2007 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"Mason Hale" <masonhale@gmail.com> writes:
Can you show us all the triggers on this table?
Here they are:
Triggers:
entry_correct_published_at_trigger BEFORE INSERT OR UPDATE ON entryFOR
EACH ROW EXECUTE PROCEDURE correct_published_at()
entry_feed_page_trigger BEFORE INSERT OR UPDATE ON entry FOR EACHROW
EXECUTE PROCEDURE entry_feed_page_trigger()
entry_updated_at_trigger BEFORE UPDATE ON entry FOR EACH ROW EXECUTE
PROCEDURE update_updated_at_timestamp()
feed_entry_count_trigger AFTER INSERT ON entry FOR EACH ROW EXECUTE
PROCEDURE update_feed_entry_count()Actually I wanted to see the function bodies ...
regards, tom lane
"Mason Hale" <masonhale@gmail.com> writes:
On Dec 31, 2007 9:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Do you by any chance still have 000000010000042200000058 and
000000010000042200000059 archived? If so it would be useful to
look at the first dozen lines of "od -x" dump of each of them.
Yes, I do. Here's the output:
[postgres@dev-db-2 wal_archive]$ od -x 000000010000042200000058 | head -n15
0000000 d05e 0002 0001 0000 0423 0000 0000 c100
0000020 f7df 472e e701 4728 0000 0100 2000 0000
0000040 a1db 81e6 0423 0000 0068 c000 0000 0000
Hmm, something wrong here. The data looks sane, but the page header
address ought to be in 0000000100000423000000C1 if I'm not mistaken.
And what's even odder is that the second page of the file has got
0020000 d05e 0000 0001 0000 0422 0000 2000 5800
0020020 53c2 48bc 0422 0000 1fbc 5800 ce6f 2edd
0020040 003a 0000 001e 0000 0b00 0000 067f 0000
0020060 41be 0000 1ff8 0015 0000 0186 0007 0000
0020100 4337 000a 000c 008f 0122 0000 db84 d429
0020120 0422 0000 2010 5800 ce6f 2edd 003a 0000
which *is* what you'd expect to find at the second block of
000000010000042200000058.
And then we have
[postgres@dev-db-2 wal_archive]$ od -x 000000010000042200000059 | head -n15
0000000 d05e 0001 0001 0000 006b 0000 6000 69dc
0000020 12ae 0000 6380 0024 0010 375a 21cd 1174
0000040 4001 0001 637c 0058 0010 375a 21cd 1174
0000060 4001 0001 6355 0010 0010 375a 21cd 1174
0000100 4001 0001 631d 005a 0010 375a 21cd 1174
which is just completely off in left field --- that's not even close to
being the right sequence number, plus it's not a valid
first-page-of-file header (which is what the xlog complaint message
was about). But on its own terms it might be valid data for someplace
in the middle of 000000010000006B00000069.
It might be worth trawling through both files to check the page headers
(every 8K) and see which ones agree with expectation and which don't.
The state of the ...0058 file might be explained by the theory that
you'd archived it a bit too late (after the first page had been
overwritten with newer WAL data), but the ...0059 file seems just plain
broken. I am starting to wonder about hardware or OS misfeasance
causing writes to be lost or misdirected.
regards, tom lane
On Mon, 2007-12-31 at 11:53 -0500, Tom Lane wrote:
It might be worth trawling through both files to check the page headers
(every 8K) and see which ones agree with expectation and which don't.
The state of the ...0058 file might be explained by the theory that
you'd archived it a bit too late (after the first page had been
overwritten with newer WAL data),
The interlock with .ready and .done should prevent reuse of a file. So
the only way this could happen is if the archive_command queued a
request to copy, rather than performing the copy immediately.
So I was going to say "thats not possible", but perhaps rsync might
become confused by the file renaming mechanism we use?
but the ...0059 file seems just plain
broken.
Yeh
I am starting to wonder about hardware or OS misfeasance
causing writes to be lost or misdirected.
Agreed
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com
"Mason Hale" <masonhale@gmail.com> writes:
Tom, I'll send these to you privately.
Thanks. I don't see anything particularly surprising there though.
What I was wondering about was whether your application was in the
habit of doing repeated no-op updates on the same "entry" row.
The pg_filedump outputs seem to blow away any theory of hardware-level
duplication of the row --- all the tuples on both pages have the
expected block number in their headers, so it seems PG deliberately
put them where they are. And the two tuples at issue are both marked
UPDATED, so they clearly are updated versions of some now-lost original.
What is not clear is whether they are independent updates of the same
original or whether there was a chain of updates --- that is, was the
newer one (which from the timestamp must be the one in the
lower-numbered block) made by an update from the older one, or from the
lost original?
Since the older one doesn't show any sign of having been updated itself
(in particular, no xmax and its ctid still points to itself), the former
theory would require assuming that the page update "got lost" --- was
discarded without being written to disk. On the other hand, the latter
theory seems to require a similar assumption with respect to whatever
page held the original.
Given this, and the index corruption you showed before (the wrong
sibling link, which would represent index breakage quite independent of
what was in the heap), and the curious contents of your WAL files
(likewise not explainable by anything going wrong within a table),
I'm starting to think that Occam's razor says you've got hardware
problems. Or maybe a kernel-level bug that is causing writes to get
discarded.
regards, tom lane
Simon Riggs <simon@2ndquadrant.com> writes:
On Mon, 2007-12-31 at 11:53 -0500, Tom Lane wrote:
The state of the ...0058 file might be explained by the theory that
you'd archived it a bit too late (after the first page had been
overwritten with newer WAL data),
The interlock with .ready and .done should prevent reuse of a file. So
the only way this could happen is if the archive_command queued a
request to copy, rather than performing the copy immediately.
So I was going to say "thats not possible", but perhaps rsync might
become confused by the file renaming mechanism we use?
Actually, the other problem with that theory is that the slave swallowed
the file without complaint. Since the WAL reader code does check that
the page header contains the expected address, this seems to imply that
what the slave saw must have had 422/58 in it, not the 423/C1 we see
now. So what needs to be explained is why what Mason is looking at now
is different from what the slave saw ten days ago.
regards, tom lane
On Mon, 2007-12-31 at 12:33 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
On Mon, 2007-12-31 at 11:53 -0500, Tom Lane wrote:
The state of the ...0058 file might be explained by the theory that
you'd archived it a bit too late (after the first page had been
overwritten with newer WAL data),The interlock with .ready and .done should prevent reuse of a file. So
the only way this could happen is if the archive_command queued a
request to copy, rather than performing the copy immediately.
So I was going to say "thats not possible", but perhaps rsync might
become confused by the file renaming mechanism we use?Actually, the other problem with that theory is that the slave swallowed
the file without complaint.
No, it barfed. Mason showed us a recovery script, so it came from the
slave.
Since the WAL reader code does check that
the page header contains the expected address, this seems to imply that
what the slave saw must have had 422/58 in it, not the 423/C1 we see
now. So what needs to be explained is why what Mason is looking at now
is different from what the slave saw ten days ago.
So the slave did see a problem ten days ago, though I take your point
that the problem we see now may not be the as it was back then.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com