problems with table corruption continued

Started by Brian Hirtabout 24 years ago24 messages
#1Brian Hirt
bhirt@mobygames.com

Okay, here's a follow up to my previous messages "ACK table corrupted,
unique index violated."

I've been trying to clean up the corruptions that i mentioned earlier. I
felt most comfortable shutting down all my application servers, restarting
postgres, doing a dump of my database and rebuilding it with a pginit and
complete reload. So far so good. I went to fix one of the corrupted tables
and i have another strange experience. I'm still looking into other
possibilities such as a hardware failure; but i thought this might be
interesting or helpful in the context of my previous post: Basically the
table with duplicate oid/id now has unique oid from the relead, so I'm going
to delete the duplicate rows and recreate the unique index on the identity
column.

basement=# select count(*),developer_aka_id from developer_aka group by
developer_aka_id having count(*) <> 1;
count | developer_aka_id
-------+------------------
2 | 9789
2 | 10025
2 | 40869
(3 rows)

basement=# select oid,* from developer_aka where developer_aka_id in
(9789,10025,40869);
oid | developer_id | developer_aka_id | first_name | last_name
-------+--------------+------------------+-------------------+-----------
48390 | 1916 | 9789 | Chris | Smith
48402 | 35682 | 40869 | Donald "Squirral" | Fisk
48425 | 4209 | 10025 | Mike | Glosecki
48426 | 1916 | 9789 | Chris | Smith
48427 | 35682 | 40869 | Donald "Squirral" | Fisk
48428 | 4209 | 10025 | Mike | Glosecki
(6 rows)

basement=# delete from developer_aka where oid in (48390,48402,48425);
DELETE 3
basement=# select count(*),developer_aka_id from developer_aka group by
developer_aka_id having count(*) <> 1;
count | developer_aka_id
-------+------------------
(0 rows)

basement=# create unique index developer_aka_pkey on
developer_aka(developer_aka_id);
CREATE
basement=# VACUUM ANALYZE developer_aka;
ERROR: Cannot insert a duplicate key into unique index developer_aka_pkey

#2Brian Hirt
bhirt@mobygames.com
In reply to: Brian Hirt (#1)
3 attachment(s)
Re: problems with table corruption continued

Tom & All:

I've been looking into this problem some more and I've been able to
consistantly reproduce the error. I've testing it on two different
machines; both running 7.1.3 on a i686-pc-linux-gnu configuration. One
machine is RH7.2 and the other is RH6.2.

I still haven't been able to reproduce the problem with corrupted
indexes/tables (NUMBER OF TUPLES IS NOT THE SAME AS HEAP -- duplicate rows
with same oid & pkey); but I'm hopefull that these two problems are related.
Also, i wanted to inform you that the same two tables became corrupted on
12-15-2001 after my 12-12-2001 reload).

I've attached three files, a typescript, and two sql files. I found that if
the commands were combined into a single file and run in a single pgsql
session, the error does not occur -- so it's important to follow the
commands exactly like they are in the typescript. If for some reason the
errors aren't reproducable on your machines, let me know and we will try to
find out what's unique about my setup.

Thanks.

----- Original Message -----
From: "Brian Hirt" <bhirt@mobygames.com>
To: "Postgres Hackers" <pgsql-hackers@postgresql.org>
Cc: "Brian A Hirt" <bhirt@berkhirt.com>
Sent: Wednesday, December 12, 2001 11:30 AM
Subject: [HACKERS] problems with table corruption continued

Okay, here's a follow up to my previous messages "ACK table corrupted,
unique index violated."

I've been trying to clean up the corruptions that i mentioned earlier. I
felt most comfortable shutting down all my application servers, restarting
postgres, doing a dump of my database and rebuilding it with a pginit and
complete reload. So far so good. I went to fix one of the corrupted

tables

and i have another strange experience. I'm still looking into other
possibilities such as a hardware failure; but i thought this might be
interesting or helpful in the context of my previous post: Basically the
table with duplicate oid/id now has unique oid from the relead, so I'm

going

Show quoted text

to delete the duplicate rows and recreate the unique index on the identity
column.

basement=# select count(*),developer_aka_id from developer_aka group by
developer_aka_id having count(*) <> 1;
count | developer_aka_id
-------+------------------
2 | 9789
2 | 10025
2 | 40869
(3 rows)

basement=# select oid,* from developer_aka where developer_aka_id in
(9789,10025,40869);
oid | developer_id | developer_aka_id | first_name | last_name
-------+--------------+------------------+-------------------+-----------
48390 | 1916 | 9789 | Chris | Smith
48402 | 35682 | 40869 | Donald "Squirral" | Fisk
48425 | 4209 | 10025 | Mike | Glosecki
48426 | 1916 | 9789 | Chris | Smith
48427 | 35682 | 40869 | Donald "Squirral" | Fisk
48428 | 4209 | 10025 | Mike | Glosecki
(6 rows)

basement=# delete from developer_aka where oid in (48390,48402,48425);
DELETE 3
basement=# select count(*),developer_aka_id from developer_aka group by
developer_aka_id having count(*) <> 1;
count | developer_aka_id
-------+------------------
(0 rows)

basement=# create unique index developer_aka_pkey on
developer_aka(developer_aka_id);
CREATE
basement=# VACUUM ANALYZE developer_aka;
ERROR: Cannot insert a duplicate key into unique index developer_aka_pkey

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Attachments:

typescriptapplication/octet-stream; name=typescriptDownload
b2.sqlapplication/octet-stream; name=b2.sqlDownload
b3.sqlapplication/octet-stream; name=b3.sqlDownload
#3Brian Hirt
bhirt@mobygames.com
In reply to: Brian Hirt (#1)
Re: problems with table corruption continued

Great, I'm also trying to create a reproducable test case for the original
problem i reported with duplicate rows/oids/pkeys. Maybe both problems are
a result of the same bug; i don't know.

BTW, if you remove the index that is based on the pgpsql function
"developer_aka_seach_idx" the problem is not reproducable. But you've
probably figured that out by now.

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Brian Hirt" <bhirt@mobygames.com>
Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt"
<bhirt@berkhirt.com>
Sent: Tuesday, December 18, 2001 10:52 AM
Subject: Re: [HACKERS] problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

I've attached three files, a typescript, and two sql files. I found

that if

the commands were combined into a single file and run in a single pgsql
session, the error does not occur -- so it's important to follow the
commands exactly like they are in the typescript. If for some reason

the

errors aren't reproducable on your machines, let me know and we will try

to

Show quoted text

find out what's unique about my setup.

Indeed, it reproduces just fine here --- not only that, but I get an
Assert failure when I try it in current sources.

Many, many thanks! I shall start digging forthwith...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#4Brian Hirt
bhirt@mobygames.com
In reply to: Brian Hirt (#1)
Re: problems with table corruption continued

Tom,

I'm glad you found the cause. I already deleted the index a few days back
after I strongly suspected that they were related to the problem. The
reason i created the index was to help locate a record based on the name of
a person with an index lookup instead of a sequential scan on a table with
several hundred thousand rows.

I was trying to avoid adding additional computed fields to the tables and
maintaining them with triggers, indexing and searching on them. The index
function seemed like an elegant solution to the problem; although at the
time I was completely unaware of 'iscachable';

Do you think this might also explain the following errors i was seeing?

NOTICE: Cannot rename init file
/moby/pgsql/base/156130/pg_internal.init.19833 to
/moby/pgsql/base/156130/pg_internal.init: No such file or directory

--brian

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Brian Hirt" <bhirt@mobygames.com>
Cc: "Vadim Mikheev" <vmikheev@sectorbase.com>; "Postgres Hackers"
<pgsql-hackers@postgresql.org>; "Brian A Hirt" <bhirt@berkhirt.com>
Sent: Tuesday, December 18, 2001 11:48 AM
Subject: Re: [HACKERS] problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

[ example case that creates a duplicate tuple ]

Got it. The problem arises in this section of vacuum.c, near the bottom
of repair_frag:

if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED))
{
if ((TransactionId) tuple.t_data->t_cmin != myXID)
elog(ERROR, "Invalid XID in t_cmin (3)");
if (tuple.t_data->t_infomask & HEAP_MOVED_OFF)
{
itemid->lp_flags &= ~LP_USED;
num_tuples++;
}
else
elog(ERROR, "HEAP_MOVED_OFF was expected (2)");
}

This is trying to get rid of the original copy of a tuple that's been
moved to another page. The problem is that your index function causes a
table scan, which means that by the time control gets here, someone else
has looked at this tuple and marked it good --- so the initial test of
HEAP_XMIN_COMMITTED fails, and the tuple is never removed!

I would say that it's incorrect for vacuum.c to assume that
HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN
tuples during the course of vacuum's processing; after all, the xmin
definitely does refer to a committed xact, and we can't realistically
assume that we know what processing will be induced by user-defined
index functions. Vadim, what do you think? How should we fix this?

In the meantime, Brian, I think you ought to get rid of your index
CREATE INDEX "developer_aka_search_idx" on "developer_aka" using btree

( developer_aka_search_name ("developer_aka_id") "varchar_ops" );

Show quoted text

I do not think this index is well-defined: an index function ought
to have the property that its output depends solely on its input and
cannot change over time. This function cannot make that claim.
(In 7.2, you can't even create the index unless you mark the function
iscachable, which is really a lie.) I'm not even real sure what you
expect the index to do for you...

I do not know whether this effect explains all the reports of duplicate
tuples we've seen in the last few weeks. We need to ask whether the
other complainants were using index functions that tried to do table
scans.

regards, tom lane

#5Brian Hirt
bhirt@mobygames.com
In reply to: Brian Hirt (#1)
Re: problems with table corruption continued

Yes, both tables had similiar functions, and the corruption was limited to
only those two tables.

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Brian Hirt" <bhirt@mobygames.com>
Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt"
<bhirt@berkhirt.com>
Sent: Tuesday, December 18, 2001 11:53 AM
Subject: Re: [HACKERS] problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

Great, I'm also trying to create a reproducable test case for the

original

problem i reported with duplicate rows/oids/pkeys. Maybe both problems

are

Show quoted text

a result of the same bug; i don't know.

Were the duplicate rows all in tables that had functional indexes based
on functions similar to developer_aka_search_name? The problem we're
seeing here seems to be due to VACUUM not being able to cope with the
side effects of the SELECT inside the index function.

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brian Hirt (#2)
Re: problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

I've attached three files, a typescript, and two sql files. I found that if
the commands were combined into a single file and run in a single pgsql
session, the error does not occur -- so it's important to follow the
commands exactly like they are in the typescript. If for some reason the
errors aren't reproducable on your machines, let me know and we will try to
find out what's unique about my setup.

Indeed, it reproduces just fine here --- not only that, but I get an
Assert failure when I try it in current sources.

Many, many thanks! I shall start digging forthwith...

regards, tom lane

#7Brian Hirt
bhirt@mobygames.com
In reply to: Brian Hirt (#1)
Re: problems with table corruption continued

Tom,

I probably could have written the function like you suggest in this email,
and i probably will look into doing so. This was the first time I tried
creating an index function, and one of the few times i've used plpgsql. In
general i have very little experience with doing this kind of stuff in
postgres (i try to stick to standard SQL as much as possible) and it looks
like i've stumbled onto this problem because of a bad design decision on my
part and a lack of understanding of how index functions work.

Thanks for the suggestion.

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Brian Hirt" <bhirt@mobygames.com>
Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt"
<bhirt@berkhirt.com>
Sent: Tuesday, December 18, 2001 12:20 PM
Subject: Re: [HACKERS] problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

I was trying to avoid adding additional computed fields to the tables

and

maintaining them with triggers, indexing and searching on them. The

index

Show quoted text

function seemed like an elegant solution to the problem

Understood, but can you write the index function in a way that avoids
having it do a SELECT to get at data that it hasn't been passed? I'm
wondering if you can't define the function as just
f(first_name, last_name) = upper(first_name || ' ' || last_name)
and create the index on f(first_name, last_name). You haven't shown us
the queries you expect the index to be helpful for, so maybe this is not
workable...

regards, tom lane

#8Brian Hirt
bhirt@mobygames.com
In reply to: Brian Hirt (#1)
Re: problems with table corruption continued

Well, when my application starts, about 100 backend database connections
start up at the same time so this fits in with the circumstance you
describe. However, I'm just using a standard ext2 filesystem on 2.2 linux
kernel.

It's good to know that this error should be harmless.

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Brian Hirt" <bhirt@mobygames.com>
Cc: "Postgres Hackers" <pgsql-hackers@postgresql.org>; "Brian A Hirt"
<bhirt@berkhirt.com>
Sent: Tuesday, December 18, 2001 12:32 PM
Subject: Re: [HACKERS] problems with table corruption continued

Show quoted text

"Brian Hirt" <bhirt@mobygames.com> writes:

Do you think this might also explain the following errors i was seeing?

NOTICE: Cannot rename init file
/moby/pgsql/base/156130/pg_internal.init.19833 to
/moby/pgsql/base/156130/pg_internal.init: No such file or directory

No, that error is not coming from anywhere near VACUUM; it's from
relcache startup (see write_irels in src/backend/utils/cache/relcache.c).
The rename source file has just been created in the very same routine,
so it's difficult to see how one would get a "No such file" failure from
rename().

It is possible that several backends create temp init files at the
same time and all try to rename their own temp files into place as
the pg_internal.init file. However, that should work: the rename
man page says

The rename() system call causes the source file to be renamed to
target. If target exists, it is first removed. Both source and
target must be of the same type (that is, either directories or
nondirectories), and must reside on the same file system.

If target can be created or if it existed before the call, rename()
guarantees that an instance of target will exist, even if the system
crashes in the midst of the operation.

so we should end up with the extra copies deleted and just one init file
remaining after the slowest backend renames its copy into place.

Do you by chance have your database directory mounted via NFS, or some
other strange filesystem where the normal semantics of concurrent rename
might not work quite right?

FWIW, I believe this condition is pretty harmless, but it would be nice
to understand exactly why you're seeing the message.

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brian Hirt (#2)
Re: problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

[ example case that creates a duplicate tuple ]

Got it. The problem arises in this section of vacuum.c, near the bottom
of repair_frag:

if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED))
{
if ((TransactionId) tuple.t_data->t_cmin != myXID)
elog(ERROR, "Invalid XID in t_cmin (3)");
if (tuple.t_data->t_infomask & HEAP_MOVED_OFF)
{
itemid->lp_flags &= ~LP_USED;
num_tuples++;
}
else
elog(ERROR, "HEAP_MOVED_OFF was expected (2)");
}

This is trying to get rid of the original copy of a tuple that's been
moved to another page. The problem is that your index function causes a
table scan, which means that by the time control gets here, someone else
has looked at this tuple and marked it good --- so the initial test of
HEAP_XMIN_COMMITTED fails, and the tuple is never removed!

I would say that it's incorrect for vacuum.c to assume that
HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN
tuples during the course of vacuum's processing; after all, the xmin
definitely does refer to a committed xact, and we can't realistically
assume that we know what processing will be induced by user-defined
index functions. Vadim, what do you think? How should we fix this?

In the meantime, Brian, I think you ought to get rid of your index
CREATE INDEX "developer_aka_search_idx" on "developer_aka" using btree ( developer_aka_search_name ("developer_aka_id") "varchar_ops" );
I do not think this index is well-defined: an index function ought
to have the property that its output depends solely on its input and
cannot change over time. This function cannot make that claim.
(In 7.2, you can't even create the index unless you mark the function
iscachable, which is really a lie.) I'm not even real sure what you
expect the index to do for you...

I do not know whether this effect explains all the reports of duplicate
tuples we've seen in the last few weeks. We need to ask whether the
other complainants were using index functions that tried to do table
scans.

regards, tom lane

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brian Hirt (#3)
Re: problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

Great, I'm also trying to create a reproducable test case for the original
problem i reported with duplicate rows/oids/pkeys. Maybe both problems are
a result of the same bug; i don't know.

Were the duplicate rows all in tables that had functional indexes based
on functions similar to developer_aka_search_name? The problem we're
seeing here seems to be due to VACUUM not being able to cope with the
side effects of the SELECT inside the index function.

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brian Hirt (#4)
Re: problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

I was trying to avoid adding additional computed fields to the tables and
maintaining them with triggers, indexing and searching on them. The index
function seemed like an elegant solution to the problem

Understood, but can you write the index function in a way that avoids
having it do a SELECT to get at data that it hasn't been passed? I'm
wondering if you can't define the function as just
f(first_name, last_name) = upper(first_name || ' ' || last_name)
and create the index on f(first_name, last_name). You haven't shown us
the queries you expect the index to be helpful for, so maybe this is not
workable...

regards, tom lane

#12Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#11)
Re: problems with table corruption continued

This is trying to get rid of the original copy of a tuple that's been
moved to another page. The problem is that your index
function causes a
table scan, which means that by the time control gets here,
someone else
has looked at this tuple and marked it good --- so the initial test of

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

HEAP_XMIN_COMMITTED fails, and the tuple is never removed!

I would say that it's incorrect for vacuum.c to assume that
HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN
tuples during the course of vacuum's processing; after all, the xmin
definitely does refer to a committed xact, and we can't realistically
assume that we know what processing will be induced by user-defined
index functions. Vadim, what do you think? How should we fix this?

But it's incorrect for table scan to mark tuple as good neither.
Looks like we have to add checks for the case
TransactionIdIsCurrentTransactionId(tuple->t_cmin) when
there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to
all HeapTupleSatisfies* in tqual.c as we do in
HeapTupleSatisfiesDirty - note comments about uniq btree-s there.

Vadim

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Brian Hirt (#4)
Re: problems with table corruption continued

"Brian Hirt" <bhirt@mobygames.com> writes:

Do you think this might also explain the following errors i was seeing?

NOTICE: Cannot rename init file
/moby/pgsql/base/156130/pg_internal.init.19833 to
/moby/pgsql/base/156130/pg_internal.init: No such file or directory

No, that error is not coming from anywhere near VACUUM; it's from
relcache startup (see write_irels in src/backend/utils/cache/relcache.c).
The rename source file has just been created in the very same routine,
so it's difficult to see how one would get a "No such file" failure from
rename().

It is possible that several backends create temp init files at the
same time and all try to rename their own temp files into place as
the pg_internal.init file. However, that should work: the rename
man page says

The rename() system call causes the source file to be renamed to
target. If target exists, it is first removed. Both source and
target must be of the same type (that is, either directories or
nondirectories), and must reside on the same file system.

If target can be created or if it existed before the call, rename()
guarantees that an instance of target will exist, even if the system
crashes in the midst of the operation.

so we should end up with the extra copies deleted and just one init file
remaining after the slowest backend renames its copy into place.

Do you by chance have your database directory mounted via NFS, or some
other strange filesystem where the normal semantics of concurrent rename
might not work quite right?

FWIW, I believe this condition is pretty harmless, but it would be nice
to understand exactly why you're seeing the message.

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#12)
Re: problems with table corruption continued

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

I would say that it's incorrect for vacuum.c to assume that
HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN
tuples during the course of vacuum's processing; after all, the xmin
definitely does refer to a committed xact, and we can't realistically
assume that we know what processing will be induced by user-defined
index functions. Vadim, what do you think? How should we fix this?

But it's incorrect for table scan to mark tuple as good neither.

Oh, that makes sense.

Looks like we have to add checks for the case
TransactionIdIsCurrentTransactionId(tuple->t_cmin) when
there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to
all HeapTupleSatisfies* in tqual.c as we do in
HeapTupleSatisfiesDirty - note comments about uniq btree-s there.

Sounds like a plan. Do you want to work on this, or shall I?

regards, tom lane

#15Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#14)
Re: problems with table corruption continued

Looks like we have to add checks for the case
TransactionIdIsCurrentTransactionId(tuple->t_cmin) when
there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to
all HeapTupleSatisfies* in tqual.c as we do in
HeapTupleSatisfiesDirty - note comments about uniq btree-s there.

Sounds like a plan. Do you want to work on this, or shall I?

Sorry, I'm not able to do this fast enough for current pre-release stage -:(

Vadim

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#15)
Re: problems with table corruption continued

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

Sounds like a plan. Do you want to work on this, or shall I?

Sorry, I'm not able to do this fast enough for current pre-release stage -:(

Okay. I'll work on a patch and send it to you for review, if you like.

regards, tom lane

#17Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#14)
Re: problems with table corruption continued

-----Original Message-----
From: Tom Lane

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

I would say that it's incorrect for vacuum.c to assume that
HEAP_XMIN_COMMITTED can't become set on HEAP_MOVED_OFF/HEAP_MOVED_IN
tuples during the course of vacuum's processing; after all, the xmin
definitely does refer to a committed xact, and we can't realistically
assume that we know what processing will be induced by user-defined
index functions. Vadim, what do you think? How should we fix this?

But it's incorrect for table scan to mark tuple as good neither.

Oh, that makes sense.

Looks like we have to add checks for the case
TransactionIdIsCurrentTransactionId(tuple->t_cmin) when
there is HEAP_MOVED_OFF or HEAP_MOVED_IN in t_infomask to
all HeapTupleSatisfies* in tqual.c as we do in
HeapTupleSatisfiesDirty - note comments about uniq btree-s there.

Should the change be TransactionIdIsInProgress(tuple->t_cmin) ?

The cause of Brian's issue was exactly what I was afraid of.
VACUUM is guarded by AccessExclusive lock but IMHO we
shouldn't rely on it too heavily.

regards,
Hiroshi Inoue

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#17)
Re: problems with table corruption continued

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Should the change be TransactionIdIsInProgress(tuple->t_cmin) ?

I'd be willing to do
if (tuple->t_cmin is my XID)
do something;
Assert(!TransactionIdIsInProgress(tuple->t_cmin));
if that makes you feel better. But anything that's scanning
a table exclusive-locked by another transaction is broken.
(BTW: contrib/pgstattuple is thus broken. Will fix.)

regards, tom lane

#19Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#17)
Re: problems with table corruption continued

Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Should the change be TransactionIdIsInProgress(tuple->t_cmin) ?

I'd be willing to do
if (tuple->t_cmin is my XID)
do something;
Assert(!TransactionIdIsInProgress(tuple->t_cmin));
if that makes you feel better.

It's useless in hard to reproduce cases.
I've thought but given up many times to propose this
change and my decision seems to have been right because
I see only *Assert* even after this issue.

But anything that's scanning
a table exclusive-locked by another transaction is broken.
(BTW: contrib/pgstattuple is thus broken. Will fix.)

It seems dangerous to me to rely only on developers'
carefulness. There are some places where relations
are opened with NoLock. We had better change them.
I once examined them but AFAIR there are few cases
when they are really opened with NoLock. In most
cases they are already opened previously with other
lock modes. I don't remember well if there was a
really dangerous(scan an relation opened with NoLock)
place.

regards,
Hiroshi Inoue

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#19)
Re: problems with table corruption continued

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

I don't remember well if there was a really dangerous(scan an relation
opened with NoLock) place.

I have just looked for that, and the only such place is in the new
contrib module pgstattuple. This is clearly broken, since there is
nothing stopping someone from (eg) dropping the relation while
pgsstattuple is scanning it. I think it should get AccessShareLock
on the relation.

The ri_triggers code has a lot of places that open things NoLock,
but it only looks into the relcache entry and doesn't try to scan
the relation. Nonetheless that code bothers me; we could be using
an obsolete relcache entry if someone has just committed an ALTER
TABLE on the relation. Some of the cases may be safe because a lock
is held higher up (eg, on the table from which the trigger was fired)
but I doubt they all are.

regards, tom lane

#21Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Tom Lane (#20)
Re: problems with table corruption continued

On Tue, 18 Dec 2001, Tom Lane wrote:

The ri_triggers code has a lot of places that open things NoLock,
but it only looks into the relcache entry and doesn't try to scan
the relation. Nonetheless that code bothers me; we could be using
an obsolete relcache entry if someone has just committed an ALTER
TABLE on the relation. Some of the cases may be safe because a lock
is held higher up (eg, on the table from which the trigger was fired)
but I doubt they all are.

Probably not, since it looks like that's being done for the other table of
the constraint (not the one on which the trigger was fired).

#22Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Stephan Szabo (#21)
Re: problems with table corruption continued

Stephan Szabo wrote:

On Tue, 18 Dec 2001, Tom Lane wrote:

The ri_triggers code has a lot of places that open things NoLock,
but it only looks into the relcache entry and doesn't try to scan
the relation. Nonetheless that code bothers me; we could be using
an obsolete relcache entry if someone has just committed an ALTER
TABLE on the relation. Some of the cases may be safe because a lock
is held higher up (eg, on the table from which the trigger was fired)
but I doubt they all are.

Probably not, since it looks like that's being done for the other table of
the constraint (not the one on which the trigger was fired).

If a lock is held already, acquiring an AccessShareLock
would cause no addtional conflict. I don't see any reason
to walk a tightrope with NoLock intentionally.

regards,
Hiroshi Inoue

#23Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Hiroshi Inoue (#22)
Re: problems with table corruption continued

On Wed, 19 Dec 2001, Hiroshi Inoue wrote:

Stephan Szabo wrote:

On Tue, 18 Dec 2001, Tom Lane wrote:

The ri_triggers code has a lot of places that open things NoLock,
but it only looks into the relcache entry and doesn't try to scan
the relation. Nonetheless that code bothers me; we could be using
an obsolete relcache entry if someone has just committed an ALTER
TABLE on the relation. Some of the cases may be safe because a lock
is held higher up (eg, on the table from which the trigger was fired)
but I doubt they all are.

Probably not, since it looks like that's being done for the other table of
the constraint (not the one on which the trigger was fired).

If a lock is held already, acquiring an AccessShareLock
would cause no addtional conflict. I don't see any reason
to walk a tightrope with NoLock intentionally.

I don't know why NoLock was used there, I was just pointing out that the
odds of a lock being held higher up is probably low.

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#16)
Re: problems with table corruption continued

Okay, I've applied the attached patch to tqual.c. This brings all the
variants of HeapTupleSatisfies up to speed: I have compared them
carefully and they now all make equivalent series of tests on the tuple
status (though they may interpret the results differently).

I decided that Hiroshi had a good point about testing
TransactionIdIsInProgress, so the code now does that before risking
a change to t_infomask, even in the VACUUM cases.

regards, tom lane

*** src/backend/utils/time/tqual.c.orig	Mon Oct 29 15:31:08 2001
--- src/backend/utils/time/tqual.c	Wed Dec 19 12:09:32 2001
***************
*** 31,37 ****
  /*
   * HeapTupleSatisfiesItself
   *		True iff heap tuple is valid for "itself."
!  *		"{it}self" means valid as of everything that's happened
   *		in the current transaction, _including_ the current command.
   *
   * Note:
--- 31,37 ----
  /*
   * HeapTupleSatisfiesItself
   *		True iff heap tuple is valid for "itself."
!  *		"itself" means valid as of everything that's happened
   *		in the current transaction, _including_ the current command.
   *
   * Note:
***************
*** 53,59 ****
  bool
  HeapTupleSatisfiesItself(HeapTupleHeader tuple)
  {
- 
  	if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
  	{
  		if (tuple->t_infomask & HEAP_XMIN_INVALID)
--- 53,58 ----
***************
*** 61,86 ****
  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return false;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
! 				return false;
  			}
  		}
  		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
  		{
  			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
  				return true;
  			if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
  				return true;
  			return false;
  		}
  		else if (!TransactionIdDidCommit(tuple->t_xmin))
--- 60,102 ----
  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  				return false;
+ 			if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
+ 			{
+ 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
+ 				{
+ 					tuple->t_infomask |= HEAP_XMIN_INVALID;
+ 					return false;
+ 				}
+ 				tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  			{
! 				if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
! 					return false;
! 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 					tuple->t_infomask |= HEAP_XMIN_COMMITTED;
! 				else
! 				{
! 					tuple->t_infomask |= HEAP_XMIN_INVALID;
! 					return false;
! 				}
  			}
  		}
  		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
  		{
  			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
  				return true;
+ 
+ 			Assert(TransactionIdIsCurrentTransactionId(tuple->t_xmax));
+ 
  			if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
  				return true;
+ 
  			return false;
  		}
  		else if (!TransactionIdDidCommit(tuple->t_xmin))
***************
*** 89,97 ****
  				tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
  			return false;
  		}
! 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  	}
! 	/* the tuple was inserted validly */
  	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
  		return true;
--- 105,115 ----
  				tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
  			return false;
  		}
! 		else
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  	}
! 
! 	/* by here, the inserting transaction has committed */

if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return true;
***************
*** 117,123 ****
return true;
}

! /* by here, deleting transaction has committed */
tuple->t_infomask |= HEAP_XMAX_COMMITTED;

  	if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
--- 135,141 ----
  		return true;
  	}

! /* xmax transaction committed */
tuple->t_infomask |= HEAP_XMAX_COMMITTED;

if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
***************
*** 177,194 ****

  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return false;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
! 				return false;
  			}
  		}
  		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
--- 195,225 ----
  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  				return false;
+ 			if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
+ 			{
+ 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
+ 				{
+ 					tuple->t_infomask |= HEAP_XMIN_INVALID;
+ 					return false;
+ 				}
+ 				tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  			{
! 				if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
! 					return false;
! 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 					tuple->t_infomask |= HEAP_XMIN_COMMITTED;
! 				else
! 				{
! 					tuple->t_infomask |= HEAP_XMIN_INVALID;
! 					return false;
! 				}
  			}
  		}
  		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
***************
*** 215,221 ****
  				tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
  			return false;
  		}
! 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  	}
  	/* by here, the inserting transaction has committed */
--- 246,253 ----
  				tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
  			return false;
  		}
! 		else
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  	}

/* by here, the inserting transaction has committed */
***************
*** 257,348 ****
}

int
! HeapTupleSatisfiesUpdate(HeapTuple tuple)
{
! HeapTupleHeader th = tuple->t_data;

if (AMI_OVERRIDE)
return HeapTupleMayBeUpdated;

! if (!(th->t_infomask & HEAP_XMIN_COMMITTED))
{
! if (th->t_infomask & HEAP_XMIN_INVALID) /* xid invalid or aborted */
return HeapTupleInvisible;

! if (th->t_infomask & HEAP_MOVED_OFF)
{
! if (TransactionIdDidCommit((TransactionId) th->t_cmin))
! {
! th->t_infomask |= HEAP_XMIN_INVALID;
return HeapTupleInvisible;
}
}
! else if (th->t_infomask & HEAP_MOVED_IN)
{
! if (!TransactionIdDidCommit((TransactionId) th->t_cmin))
{
! th->t_infomask |= HEAP_XMIN_INVALID;
! return HeapTupleInvisible;
}
}
! else if (TransactionIdIsCurrentTransactionId(th->t_xmin))
{
! if (CommandIdGEScanCommandId(th->t_cmin))
return HeapTupleInvisible; /* inserted after scan
* started */

! if (th->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */
return HeapTupleMayBeUpdated;

! Assert(TransactionIdIsCurrentTransactionId(th->t_xmax));

! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;

! if (CommandIdGEScanCommandId(th->t_cmax))
return HeapTupleSelfUpdated; /* updated after scan
* started */
else
return HeapTupleInvisible; /* updated before scan
* started */
}
! else if (!TransactionIdDidCommit(th->t_xmin))
{
! if (TransactionIdDidAbort(th->t_xmin))
! th->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
return HeapTupleInvisible;
}
! th->t_infomask |= HEAP_XMIN_COMMITTED;
}

/* by here, the inserting transaction has committed */

! if (th->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return HeapTupleMayBeUpdated;

! if (th->t_infomask & HEAP_XMAX_COMMITTED)
{
! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;
return HeapTupleUpdated; /* updated by other */
}

! if (TransactionIdIsCurrentTransactionId(th->t_xmax))
{
! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;
! if (CommandIdGEScanCommandId(th->t_cmax))
return HeapTupleSelfUpdated; /* updated after scan
* started */
else
return HeapTupleInvisible; /* updated before scan started */
}

! 	if (!TransactionIdDidCommit(th->t_xmax))
  	{
! 		if (TransactionIdDidAbort(th->t_xmax))
  		{
! 			th->t_infomask |= HEAP_XMAX_INVALID;		/* aborted */
  			return HeapTupleMayBeUpdated;
  		}
  		/* running xact */
--- 289,394 ----
  }

int
! HeapTupleSatisfiesUpdate(HeapTuple htuple)
{
! HeapTupleHeader tuple = htuple->t_data;

if (AMI_OVERRIDE)
return HeapTupleMayBeUpdated;

! if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
{
! if (tuple->t_infomask & HEAP_XMIN_INVALID)
return HeapTupleInvisible;

! 		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  				return HeapTupleInvisible;
+ 			if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
+ 			{
+ 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
+ 				{
+ 					tuple->t_infomask |= HEAP_XMIN_INVALID;
+ 					return HeapTupleInvisible;
+ 				}
+ 				tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			}
  		}
! 		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  			{
! 				if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
! 					return HeapTupleInvisible;
! 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 					tuple->t_infomask |= HEAP_XMIN_COMMITTED;
! 				else
! 				{
! 					tuple->t_infomask |= HEAP_XMIN_INVALID;
! 					return HeapTupleInvisible;
! 				}
  			}
  		}
! 		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
  		{
! 			if (CommandIdGEScanCommandId(tuple->t_cmin))
  				return HeapTupleInvisible;		/* inserted after scan
  												 * started */

! if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */
return HeapTupleMayBeUpdated;

! Assert(TransactionIdIsCurrentTransactionId(tuple->t_xmax));

! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;

! if (CommandIdGEScanCommandId(tuple->t_cmax))
return HeapTupleSelfUpdated; /* updated after scan
* started */
else
return HeapTupleInvisible; /* updated before scan
* started */
}
! else if (!TransactionIdDidCommit(tuple->t_xmin))
{
! if (TransactionIdDidAbort(tuple->t_xmin))
! tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
return HeapTupleInvisible;
}
! else
! tuple->t_infomask |= HEAP_XMIN_COMMITTED;
}

/* by here, the inserting transaction has committed */

! if (tuple->t_infomask & HEAP_XMAX_INVALID) /* xid invalid or aborted */
return HeapTupleMayBeUpdated;

! if (tuple->t_infomask & HEAP_XMAX_COMMITTED)
{
! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;
return HeapTupleUpdated; /* updated by other */
}

! if (TransactionIdIsCurrentTransactionId(tuple->t_xmax))
{
! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;
! if (CommandIdGEScanCommandId(tuple->t_cmax))
return HeapTupleSelfUpdated; /* updated after scan
* started */
else
return HeapTupleInvisible; /* updated before scan started */
}

! if (!TransactionIdDidCommit(tuple->t_xmax))
{
! if (TransactionIdDidAbort(tuple->t_xmax))
{
! tuple->t_infomask |= HEAP_XMAX_INVALID; /* aborted */
return HeapTupleMayBeUpdated;
}
/* running xact */
***************
*** 350,358 ****
}

/* xmax transaction committed */
! th->t_infomask |= HEAP_XMAX_COMMITTED;

! if (th->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;

  	return HeapTupleUpdated;	/* updated by other */
--- 396,404 ----
  	}

/* xmax transaction committed */
! tuple->t_infomask |= HEAP_XMAX_COMMITTED;

! if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
return HeapTupleMayBeUpdated;

return HeapTupleUpdated; /* updated by other */
***************
*** 374,396 ****

  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
- 			/*
- 			 * HeapTupleSatisfiesDirty is used by unique btree-s and so
- 			 * may be used while vacuuming.
- 			 */
  			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  				return false;
! 			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
! 				return false;
  			}
- 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
  			if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  			{
  				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  					tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  				else
--- 420,443 ----
  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
  			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  				return false;
! 			if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
  			{
! 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 				{
! 					tuple->t_infomask |= HEAP_XMIN_INVALID;
! 					return false;
! 				}
! 				tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
  			if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  			{
+ 				if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
+ 					return false;
  				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  					tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  				else
***************
*** 416,425 ****
  		{
  			if (TransactionIdDidAbort(tuple->t_xmin))
  			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
  				return false;
  			}
  			SnapshotDirty->xmin = tuple->t_xmin;
  			return true;		/* in insertion by other */
  		}
  		else
--- 463,473 ----
  		{
  			if (TransactionIdDidAbort(tuple->t_xmin))
  			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return false;
  			}
  			SnapshotDirty->xmin = tuple->t_xmin;
+ 			/* XXX shouldn't we fall through to look at xmax? */
  			return true;		/* in insertion by other */
  		}
  		else
***************
*** 474,479 ****
--- 522,528 ----
  	if (AMI_OVERRIDE)
  		return true;

+ /* XXX this is horribly ugly: */
if (ReferentialIntegritySnapshotOverride)
return HeapTupleSatisfiesNow(tuple);

***************
*** 484,501 ****

  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return false;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
! 				return false;
  			}
  		}
  		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
--- 533,563 ----
  		if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
! 			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  				return false;
+ 			if (!TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
+ 			{
+ 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
+ 				{
+ 					tuple->t_infomask |= HEAP_XMIN_INVALID;
+ 					return false;
+ 				}
+ 				tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			}
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
  			{
! 				if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
! 					return false;
! 				if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 					tuple->t_infomask |= HEAP_XMIN_COMMITTED;
! 				else
! 				{
! 					tuple->t_infomask |= HEAP_XMIN_INVALID;
! 					return false;
! 				}
  			}
  		}
  		else if (TransactionIdIsCurrentTransactionId(tuple->t_xmin))
***************
*** 519,528 ****
  		else if (!TransactionIdDidCommit(tuple->t_xmin))
  		{
  			if (TransactionIdDidAbort(tuple->t_xmin))
! 				tuple->t_infomask |= HEAP_XMIN_INVALID; /* aborted */
  			return false;
  		}
! 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  	}
  	/*
--- 581,591 ----
  		else if (!TransactionIdDidCommit(tuple->t_xmin))
  		{
  			if (TransactionIdDidAbort(tuple->t_xmin))
! 				tuple->t_infomask |= HEAP_XMIN_INVALID;
  			return false;
  		}
! 		else
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  	}
  	/*
***************
*** 623,646 ****
  			return HEAPTUPLE_DEAD;
  		else if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
  			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
  				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return HEAPTUPLE_DEAD;
  			}
- 			/* Assume we can only get here if previous VACUUM aborted, */
- 			/* ie, it couldn't still be in progress */
  			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (!TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
- 				/* Assume we can only get here if previous VACUUM aborted */
  				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return HEAPTUPLE_DEAD;
  			}
- 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		}
  		else if (TransactionIdIsInProgress(tuple->t_xmin))
  			return HEAPTUPLE_INSERT_IN_PROGRESS;
--- 686,715 ----
  			return HEAPTUPLE_DEAD;
  		else if (tuple->t_infomask & HEAP_MOVED_OFF)
  		{
+ 			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
+ 				return HEAPTUPLE_DELETE_IN_PROGRESS;
+ 			if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
+ 				return HEAPTUPLE_DELETE_IN_PROGRESS;
  			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
  			{
  				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return HEAPTUPLE_DEAD;
  			}
  			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		}
  		else if (tuple->t_infomask & HEAP_MOVED_IN)
  		{
! 			if (TransactionIdIsCurrentTransactionId((TransactionId) tuple->t_cmin))
! 				return HEAPTUPLE_INSERT_IN_PROGRESS;
! 			if (TransactionIdIsInProgress((TransactionId) tuple->t_cmin))
! 				return HEAPTUPLE_INSERT_IN_PROGRESS;
! 			if (TransactionIdDidCommit((TransactionId) tuple->t_cmin))
! 				tuple->t_infomask |= HEAP_XMIN_COMMITTED;
! 			else
  			{
  				tuple->t_infomask |= HEAP_XMIN_INVALID;
  				return HEAPTUPLE_DEAD;
  			}
  		}
  		else if (TransactionIdIsInProgress(tuple->t_xmin))
  			return HEAPTUPLE_INSERT_IN_PROGRESS;
***************
*** 671,676 ****
--- 740,751 ----
  	if (tuple->t_infomask & HEAP_XMAX_INVALID)
  		return HEAPTUPLE_LIVE;
+ 	if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
+ 	{
+ 		/* "deleting" xact really only marked it for update */
+ 		return HEAPTUPLE_LIVE;
+ 	}
+ 
  	if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
  	{
  		if (TransactionIdIsInProgress(tuple->t_xmax))
***************
*** 698,709 ****
  	/*
  	 * Deleter committed, but check special cases.
  	 */
- 
- 	if (tuple->t_infomask & HEAP_MARKED_FOR_UPDATE)
- 	{
- 		/* "deleting" xact really only marked it for update */
- 		return HEAPTUPLE_LIVE;
- 	}
  	if (TransactionIdEquals(tuple->t_xmin, tuple->t_xmax))
  	{
--- 773,778 ----