Possible bug in vacuum redo

Started by Tom Laneabout 24 years ago15 messages
#1Tom Lane
tgl@sss.pgh.pa.us

In the vacuum tuple-chain moving logic, shouldn't the lines that update
the new tuple's t_ctid (vacuum.c lines 1882-1891 in current sources)
be moved up to before the log_heap_move call at line 1866?

It appears to me that as the code stands, log_heap_move will log the new
tuple containing the wrong t_ctid; therefore, if we crash and have to
redo the transaction from WAL, the wrong t_ctid will be restored. No?

regards, tom lane

#2Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#1)
Re: Possible bug in vacuum redo

-----Original Message-----
From: Tom Lane

In the vacuum tuple-chain moving logic, shouldn't the lines that update
the new tuple's t_ctid (vacuum.c lines 1882-1891 in current sources)
be moved up to before the log_heap_move call at line 1866?

It appears to me that as the code stands, log_heap_move will log the new
tuple containing the wrong t_ctid; therefore, if we crash and have to
redo the transaction from WAL, the wrong t_ctid will be restored. No?

AFAIR t_ctid isn't logged in WAL.

regards,
Hiroshi Inoue

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#2)
Re: Possible bug in vacuum redo

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

AFAIR t_ctid isn't logged in WAL.

After looking at the heap_update code I think you are right. Doesn't
that render the field completely useless/unreliable?

In the simple heap_update case I think that heap_xlog_update could
easily set the old tuple's t_ctid field correctly. Not sure how
it works when VACUUM is moving tuple chains around, however.

Another thing I am currently looking at is that I do not believe VACUUM
handles tuple chain moves correctly. It only enters the chain-moving
logic if it finds a tuple that is in the *middle* of an update chain,
ie, both the prior and next tuples still exist. In the case of a
two-element update chain (only the latest and next-to-latest tuples of
a row survive VACUUM), AFAICT vacuum will happily move the latest tuple
without ever updating the previous tuple's t_ctid.

In short t_ctid seems extremely unreliable. I have been trying to work
out a way that a bad t_ctid link could lead to the duplicate-tuple
reports we've been hearing lately, but so far I haven't seen one. I do
think it can lead to missed UPDATEs in read-committed mode, however.

regards, tom lane

#4Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#3)
Re: Possible bug in vacuum redo

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

AFAIR t_ctid isn't logged in WAL.

After looking at the heap_update code I think you are right. Doesn't
that render the field completely useless/unreliable?

Redo runs with no concurrent backends. New backends invoked
after a redo operation don't need(see) the existent t_ctid values.
PostgreSQL before MVCC didn't need the t_ctid.

In the simple heap_update case I think that heap_xlog_update could
easily set the old tuple's t_ctid field correctly. Not sure how
it works when VACUUM is moving tuple chains around, however.

Another thing I am currently looking at is that I do not believe VACUUM
handles tuple chain moves correctly. It only enters the chain-moving
logic if it finds a tuple that is in the *middle* of an update chain,
ie, both the prior and next tuples still exist.

^^^^^
Isn't it *either* not *both* ?
Anyway I agree with you at the point that the tuple chain-moving
is too complex. It's one of the main reason why I prefer the new
VACUUM.

regards,
Hiroshi Inoue

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#4)
Re: Possible bug in vacuum redo

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Another thing I am currently looking at is that I do not believe VACUUM
handles tuple chain moves correctly. It only enters the chain-moving
logic if it finds a tuple that is in the *middle* of an update chain,
ie, both the prior and next tuples still exist.

^^^^^
Isn't it *either* not *both* ?

[ reads it again ] Oh, you're right.

Still, if WAL isn't taking care to maintain t_ctid then we have a
problem.

regards, tom lane

#6Hannu Krosing
hannu@tm.ee
In reply to: Hiroshi Inoue (#4)
Re: Possible bug in vacuum redo

FWIW, I got the following deadlock-like situation by simply running
pgbench :

968 ? S 0:00 /usr/bin/postmaster -D /var/lib/pgsql/data
13245 pts/1 S 0:00 su - postsql
31671 pts/1 S 0:00 bin/postmaster
31672 pts/1 S 0:00 postgres: stats buffer process
31674 pts/1 S 0:00 postgres: stats collector process
32606 pts/1 S 0:02 pgbench -p 7204 -t 128 -c 32 postsql
32613 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE
32615 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32616 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE
32617 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32618 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32623 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE
32624 pts/1 S 0:01 postgres: postsql postsql [local] idle in
transaction
32628 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE
32633 pts/1 S 0:00 postgres: postsql postsql [local] idle
32634 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32635 pts/1 S 0:00 postgres: postsql postsql [local] COMMIT
32636 pts/1 S 0:01 postgres: postsql postsql [local] UPDATE waiting
32637 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32638 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32639 pts/1 S 0:00 postgres: postsql postsql [local] UPDATE
32640 pts/1 S 0:00 postgres: checkpoint subprocess
32692 pts/2 S 0:00 su - postsql
32752 pts/2 S 0:00 grep post

It is probably some bug in pgbench, but could also be something more
serious.

I was running pgbench using modified (to use vacuum full between runs)
Tatsuos
bench script on a uniprocessor Athlon 850.:

[postsql@taru bench]$ cat bench.sh
#! /bin/sh
for i in 1 2 3 4 5 6 7 8
do
# pg_ctl stop
# pg_ctl -w -o '-c "wal_sync_method=fdatasync" -S' start
# pg_ctl -w start
sh mpgbench| tee bench.res.$i |
sh extract.sh > bench.data.$i
done
[postsql@taru bench]$ cat mpgbench
#! /bin/sh
DB=postsql
PORT=7204

#pgbench -p $PORT -i -s 128 $DB
#pgbench -p $PORT -i -s 10 $DB
#pgbench -p $PORT -i -s 1 $DB

for i in 1 2 4 8 16 32 64 128
do
# t=$(echo "scale=0; 512/$i" | bc -l)
t=$(echo "scale=0; 4096/$i" | bc -l)
echo $i concurrent users... 1>&2
echo $t transactions each... 1>&2
pgbench -p $PORT -t $t -c $i $DB
psql -p $PORT -c 'vacuum full' $DB
psql -p $PORT -c 'checkpoint' $DB
echo "===== sync ======" 1>&2
sync;sync;sync;sleep 10
echo "===== sync done ======" 1>&2
done

[postsql@taru bench]$ cat extract.sh
#! /bin/sh
sed -n -e '/^number of clients.*'/p \
-e '/.*excluding connections establishing.*'/p |
sed -e 's/number of clients: //' \
-e 's/^tps = //' \
-e 's/(excluding connections establishing)//'|
while read i
do
echo -n "$i "
read j
echo $j
done

-------------------
Hannu

#7Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#4)
Re: Possible bug in vacuum redo

Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Another thing I am currently looking at is that I do not believe VACUUM
handles tuple chain moves correctly. It only enters the chain-moving
logic if it finds a tuple that is in the *middle* of an update chain,
ie, both the prior and next tuples still exist.

^^^^^
Isn't it *either* not *both* ?

[ reads it again ] Oh, you're right.

Still, if WAL isn't taking care to maintain t_ctid then we have a
problem.

I don't think it's preferable either. However there's
no problem unless there's an application which handle
the tuples containing the t_ctid link. I know few
applications(vacuum, create index ??) which handles
the tuples already updated to new ones and committed
before the transaction started.
Note that redo is executed alone. Shutdown recovery
is called after killing all backends if necessary.
Of cource there are no other backends running when
startup recovery is called.

regards,
Hiroshi Inoue

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#7)
Re: Possible bug in vacuum redo

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

I don't think it's preferable either. However there's
no problem unless there's an application which handle
the tuples containing the t_ctid link.

What about READ COMMITTED mode? EvalPlanQual uses the t_ctid field
to find the updated version of the row. If the t_ctid is wrong,
you might get an elog(), or you might miss the row you should have
updated, or possibly you might update a row that you shouldn't have.

regards, tom lane

#9Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#4)
Re: Possible bug in vacuum redo

Tom Lane wrote:

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

I don't think it's preferable either. However there's
no problem unless there's an application which handle
the tuples containing the t_ctid link.

What about READ COMMITTED mode? EvalPlanQual uses the t_ctid field
to find the updated version of the row.

In READ COMMITTED mode, an app searches valid tuples first
using the snapshot taken when the query started. It never
searches already updated(to newer ones) and committed tuples
at the point when the query started. Essentially t_ctid is
only needed by the concurrently running backends.

regards,
Hiroshi Inoue

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#9)
Re: Possible bug in vacuum redo

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

In READ COMMITTED mode, an app searches valid tuples first
using the snapshot taken when the query started. It never
searches already updated(to newer ones) and committed tuples
at the point when the query started. Essentially t_ctid is
only needed by the concurrently running backends.

[ thinks for awhile ] I see: you're saying that t_ctid is only
used by transactions that are concurrent with the deleting transaction,
so if there's a database crash there's no need to restore t_ctid.

Probably true, but still mighty ugly.

Meanwhile, I guess I gotta look elsewhere for a theory to explain
those reports of duplicate rows. Oh well...

regards, tom lane

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#6)
Re: Possible bug in vacuum redo

Hannu Krosing <hannu@tm.ee> writes:

FWIW, I got the following deadlock-like situation by simply running
pgbench :

Using what version? That looks like the deadlock I fixed last week.

regards, tom lane

#12Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#4)
Re: Possible bug in vacuum redo

Tom Lane wrote:

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

In READ COMMITTED mode, an app searches valid tuples first
using the snapshot taken when the query started. It never
searches already updated(to newer ones) and committed tuples
at the point when the query started. Essentially t_ctid is
only needed by the concurrently running backends.

[ thinks for awhile ] I see: you're saying that t_ctid is only
used by transactions that are concurrent with the deleting transaction,
so if there's a database crash there's no need to restore t_ctid.

Yes.

Probably true, but still mighty ugly.

Yes.

Meanwhile, I guess I gotta look elsewhere for a theory to explain
those reports of duplicate rows. Oh well...

Great. Where is it ?

regards,
Hiroshi Inoue

#13Hannu Krosing
hannu@tm.ee
In reply to: Hiroshi Inoue (#4)
Re: Possible bug in vacuum redo

Tom Lane wrote:

Hannu Krosing <hannu@tm.ee> writes:

FWIW, I got the following deadlock-like situation by simply running
pgbench :

Using what version? That looks like the deadlock I fixed last week.

Official 7.2b4. Should I get a fresh one from CVS or is beta5/RC1 imminent ?

--------------------
Hannu

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#13)
Re: Possible bug in vacuum redo

Hannu Krosing <hannu@tm.ee> writes:

Using what version? That looks like the deadlock I fixed last week.

Official 7.2b4. Should I get a fresh one from CVS or is beta5/RC1 imminent ?

Use CVS (or a nightly snapshot if that's easier). I doubt we'll do
anything until after Christmas...

regards, tom lane

#15Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#10)
Re: Possible bug in vacuum redo

Tom Lane wrote:

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

In READ COMMITTED mode, an app searches valid tuples first
using the snapshot taken when the query started. It never
searches already updated(to newer ones) and committed tuples
at the point when the query started. Essentially t_ctid is
only needed by the concurrently running backends.

[ thinks for awhile ] I see: you're saying that t_ctid is only
used by transactions that are concurrent with the deleting transaction,
so if there's a database crash there's no need to restore t_ctid.

Probably true, but still mighty ugly.

Meanwhile, I guess I gotta look elsewhere for a theory to explain
those reports of duplicate rows. Oh well...

Can someone document this in the sources somewhere? I am not sure how
to do it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026