pgsql: Fix freezing of a dead HOT-updated tuple

Started by Alvaro Herreraover 8 years ago124 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com

Fix freezing of a dead HOT-updated tuple

Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples. But
concurrent transaction commit/abort may turn DEAD some of the HOT tuples
that survived the prune, before HeapTupleSatisfiesVacuum tests them.
This happens to activate the code that decides to freeze the tuple ...
which resuscitates it, duplicating data.

(This is especially bad if there's any unique constraints, because those
are now internally violated due to the duplicate entries, though you
won't know until you try to REINDEX or dump/restore the table.)

One possible fix would be to simply skip doing anything to the tuple,
and hope that the next HOT prune would remove it. But there is a
problem: if the tuple is older than freeze horizon, this would leave an
unfrozen XID behind, and if no HOT prune happens to clean it up before
the containing pg_clog segment is truncated away, it'd later cause an
error when the XID is looked up.

Fix the problem by having the tuple freezing routines cope with the
situation: don't freeze the tuple (and keep it dead). In the cases that
the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
so that there is no need to look up the XID in pg_clog later on.

An isolation test is included, authored by Michael Paquier, loosely
based on Daniel Wood's original reproducer. It only tests one
particular scenario, though, not all the possible ways for this problem
to surface; it be good to have a more reliable way to test this more
fully, but it'd require more work.
In message /messages/by-id/20170911140103.5akxptyrwgpc25bw@alvherre.pgsql
I outlined another test case (more closely matching Dan Wood's) that
exposed a few more ways for the problem to occur.

Backpatch all the way back to 9.3, where this problem was introduced by
multixact juggling. In branches 9.3 and 9.4, this includes a backpatch
of commit e5ff9fefcd50 (of 9.5 era), since the original is not
correctable without matching the coding pattern in 9.5 up.

Reported-by: Daniel Wood
Diagnosed-by: Daniel Wood
Reviewed-by: Yi Wen Wong, Michaël Paquier
Discussion: /messages/by-id/E5711E62-8FDF-4DCA-A888-C200BF6B5742@amazon.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/20b655224249e6d2daf7ef0595995228baddb381

Modified Files
--------------
src/backend/access/heap/heapam.c | 57 +++++++++----
src/backend/commands/vacuumlazy.c | 20 ++---
src/test/isolation/expected/freeze-the-dead.out | 101 ++++++++++++++++++++++++
src/test/isolation/isolation_schedule | 1 +
src/test/isolation/specs/freeze-the-dead.spec | 27 +++++++
5 files changed, 179 insertions(+), 27 deletions(-)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

In reply to: Alvaro Herrera (#1)
Re: pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 7:47 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Fix freezing of a dead HOT-updated tuple

If I run Dan Wood's test case again, the obvious symptom (spurious
duplicates) goes away. However, the enhanced amcheck, and thus CREATE
INDEX/REINDEX, still isn't happy about this:

postgres=# select bt_index_check('t_pkey', true);
DEBUG: 00000: verifying presence of required tuples in index "t_pkey"
LOCATION: bt_check_every_level, verify_nbtree.c:424
ERROR: XX000: failed to find parent tuple for heap-only tuple at
(0,6) in table "t"
LOCATION: IndexBuildHeapRangeScan, index.c:2597
Time: 3.699 ms

--
Peter Geoghegan

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#2)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

Peter Geoghegan wrote:

On Thu, Sep 28, 2017 at 7:47 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Fix freezing of a dead HOT-updated tuple

If I run Dan Wood's test case again, the obvious symptom (spurious
duplicates) goes away. However, the enhanced amcheck, and thus CREATE
INDEX/REINDEX, still isn't happy about this:

postgres=# select bt_index_check('t_pkey', true);
DEBUG: 00000: verifying presence of required tuples in index "t_pkey"
LOCATION: bt_check_every_level, verify_nbtree.c:424
ERROR: XX000: failed to find parent tuple for heap-only tuple at
(0,6) in table "t"
LOCATION: IndexBuildHeapRangeScan, index.c:2597
Time: 3.699 ms

... Rats, I obviously missed the message where you said that amcheck
detected this problem.

Odd that it's not fixed. I guess there's still some more work to do
here ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#3)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

Alvaro Herrera wrote:

Odd that it's not fixed. I guess there's still some more work to do
here ...

Maybe what this means is that we need to do both Dan's initially
proposed patch (or something related to it) apart from the fixes already
pushed. IOW we need to put back some of the "tupkeep" business ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#4)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Alvaro Herrera wrote:

Odd that it's not fixed. I guess there's still some more work to do
here ...

Maybe what this means is that we need to do both Dan's initially
proposed patch (or something related to it) apart from the fixes already
pushed. IOW we need to put back some of the "tupkeep" business ...

We certainly do still see wrong answers to queries here:

postgres=# select ctid, xmin, xmax, * from t;
ctid | xmin | xmax | id | name | x
-------+-------+------+----+------+---
(0,1) | 21171 | 0 | 1 | 111 | 0
(0,7) | 21177 | 0 | 3 | 333 | 5
(2 rows)

postgres=# select * from t where id = 3;
id | name | x
----+------+---
3 | 333 | 5
(1 row)

postgres=# set enable_seqscan = off;
SET
postgres=# select * from t where id = 3;
id | name | x
----+------+---
(0 rows)

FWIW, I am reminded a little bit of the MultiXact/recovery bug I
reported way back in February of 2014 [1]/messages/by-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com, which also had a HOT
interaction that caused index scans to give wrong answers, despite
more or less structurally sound indexes.

[1]: /messages/by-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#5)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

Peter Geoghegan wrote:

We certainly do still see wrong answers to queries here:

postgres=# select ctid, xmin, xmax, * from t;
ctid | xmin | xmax | id | name | x
-------+-------+------+----+------+---
(0,1) | 21171 | 0 | 1 | 111 | 0
(0,7) | 21177 | 0 | 3 | 333 | 5
(2 rows)

postgres=# select * from t where id = 3;
id | name | x
----+------+---
3 | 333 | 5
(1 row)

postgres=# set enable_seqscan = off;
SET
postgres=# select * from t where id = 3;
id | name | x
----+------+---
(0 rows)

Yeah, oops.

FWIW, I am reminded a little bit of the MultiXact/recovery bug I
reported way back in February of 2014 [1], which also had a HOT
interaction that caused index scans to give wrong answers, despite
more or less structurally sound indexes.

[1] /messages/by-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com

Thanks for the reference. I didn't remember this problem and it's not
(wasn't) in my list of things to look into. Perhaps these are both the
same bug.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#6)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 2:39 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

FWIW, I am reminded a little bit of the MultiXact/recovery bug I
reported way back in February of 2014 [1], which also had a HOT
interaction that caused index scans to give wrong answers, despite
more or less structurally sound indexes.

[1] /messages/by-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com

Thanks for the reference. I didn't remember this problem and it's not
(wasn't) in my list of things to look into. Perhaps these are both the
same bug.

I was reminded of that old bug because initially, at the time, it
looked very much like a corrupt index: sequential scans were fine, but
index scans gave wrong answers. This is what I saw today.

In the end, commit 6bfa88a fixed that old recovery bug by making sure
the recovery routine heap_xlog_lock() did the right thing. In both
cases (Feb 2014 and today), the index wasn't really corrupt -- it just
pointed to the root of a HOT chain when it should point to some child
tuple (or maybe a successor HOT chain).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#7)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 2:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:

In the end, commit 6bfa88a fixed that old recovery bug by making sure
the recovery routine heap_xlog_lock() did the right thing. In both
cases (Feb 2014 and today), the index wasn't really corrupt -- it just
pointed to the root of a HOT chain when it should point to some child
tuple (or maybe a successor HOT chain).

Just to be clear, obviously I don't mean that the index should have
been magically updated to compensate for that 2014 heap_lock_tuple()
recovery bug (say, via an in-place IndexTuple update during recovery).
Certainly, the index AM was entitled to assume that the heap TID it
pointed to would continue to be the root of the same HOT chain, at
least until the next VACUUM. That was what I meant by "the index
wasn't actually corrupt".

All I meant here is that the index scan and sequential scan would have
been in agreement if something like that *had* happened artificially
(say, when the index tuple was edited with a hex editor). And, that
the actual high level problem was that we failed to treat a
HOT-updated tuple as a HOT-updated tuple, leading to consequences that
were mostly noticeable during index scans. Probably on both occasions
(back in 2014, and now).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#7)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 5:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:

In the end, commit 6bfa88a fixed that old recovery bug by making sure
the recovery routine heap_xlog_lock() did the right thing. In both
cases (Feb 2014 and today), the index wasn't really corrupt -- it just
pointed to the root of a HOT chain when it should point to some child
tuple (or maybe a successor HOT chain).

Unless I'm very confused, it's really not OK to point at a child tuple
rather than the root of the HOT chain.

Pointing to a successor HOT chain would be OK, as long as you point to
the root tuple thereof.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#9)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 3:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 28, 2017 at 5:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:

In the end, commit 6bfa88a fixed that old recovery bug by making sure
the recovery routine heap_xlog_lock() did the right thing. In both
cases (Feb 2014 and today), the index wasn't really corrupt -- it just
pointed to the root of a HOT chain when it should point to some child
tuple (or maybe a successor HOT chain).

Unless I'm very confused, it's really not OK to point at a child tuple
rather than the root of the HOT chain.

Please see my later clarification.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#6)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Fri, Sep 29, 2017 at 6:39 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Peter Geoghegan wrote:

We certainly do still see wrong answers to queries here:

postgres=# select ctid, xmin, xmax, * from t;
ctid | xmin | xmax | id | name | x
-------+-------+------+----+------+---
(0,1) | 21171 | 0 | 1 | 111 | 0
(0,7) | 21177 | 0 | 3 | 333 | 5
(2 rows)

postgres=# select * from t where id = 3;
id | name | x
----+------+---
3 | 333 | 5
(1 row)

postgres=# set enable_seqscan = off;
SET
postgres=# select * from t where id = 3;
id | name | x
----+------+---
(0 rows)

Yeah, oops.

This really looks like a problem at heap-level with the parent
redirection not getting defined (?). Please note that a subsequent
REINDEX fails as well:
=# reindex index t_pkey ;
ERROR: XX000: failed to find parent tuple for heap-only tuple at
(0,7) in table "t"
LOCATION: IndexBuildHeapRangeScan, index.c:2597

VACUUM FREEZE also is not getting things right, but a VACUUM FULL does.

Also, dropping the constraint and attempting to recreate it is failing:
=# alter table t drop constraint t_pkey;
ALTER TABLE
=# create index t_pkey on t(id);
ERROR: XX000: failed to find parent tuple for heap-only tuple at
(0,7) in table "t"
LOCATION: IndexBuildHeapRangeScan, index.c:2597

A corrupted page clearly indicates that there are no tuple redirections:
=# select lp, t_ctid, lp_off, t_infomask, t_infomask2 from
heap_page_items(get_raw_page('t', 0));
lp | t_ctid | lp_off | t_infomask | t_infomask2
----+--------+--------+------------+-------------
1 | (0,1) | 8152 | 2818 | 3
2 | null | 0 | null | null
3 | (0,4) | 8112 | 9986 | 49155
4 | (0,5) | 8072 | 9986 | 49155
5 | (0,6) | 8032 | 9986 | 49155
6 | (0,7) | 7992 | 9986 | 49155
7 | (0,7) | 7952 | 11010 | 32771
(7 rows)

And a non-corrupted page clearly shows the redirection done with lp_off:
lp | t_ctid | lp_off | t_infomask | t_infomask2
----+--------+--------+------------+-------------
1 | (0,1) | 8152 | 2818 | 3
2 | null | 7 | null | null
3 | null | 0 | null | null
4 | null | 0 | null | null
5 | null | 0 | null | null
6 | null | 0 | null | null
7 | (0,7) | 8112 | 11010 | 32771
(7 rows)
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#4)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Maybe what this means is that we need to do both Dan's initially
proposed patch (or something related to it) apart from the fixes already
pushed. IOW we need to put back some of the "tupkeep" business ...

I took the time to specifically check if that would fix the problem.
Unfortunately, it did not. We see exactly the same problem, or at
least amcheck/REINDEX produces exactly the same error. I checked both
Dan's original update_freeze.patch, and your revision that retained
some of the "tupkeep" stuff,
0002-Don-t-freeze-recently-dead-HOT-tuples, which you posted on
September 6th.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Michael Paquier
michael@paquier.xyz
In reply to: Peter Geoghegan (#12)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Sun, Oct 1, 2017 at 10:25 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Maybe what this means is that we need to do both Dan's initially
proposed patch (or something related to it) apart from the fixes already
pushed. IOW we need to put back some of the "tupkeep" business ...

I took the time to specifically check if that would fix the problem.
Unfortunately, it did not. We see exactly the same problem, or at
least amcheck/REINDEX produces exactly the same error. I checked both
Dan's original update_freeze.patch, and your revision that retained
some of the "tupkeep" stuff,
0002-Don-t-freeze-recently-dead-HOT-tuples, which you posted on
September 6th.

I did not take the time to dig into that more than two hours, but my
first feeling is that some race condition is going on with the heap
pruning.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#13)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

Michael Paquier wrote:

On Sun, Oct 1, 2017 at 10:25 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Maybe what this means is that we need to do both Dan's initially
proposed patch (or something related to it) apart from the fixes already
pushed. IOW we need to put back some of the "tupkeep" business ...

I took the time to specifically check if that would fix the problem.
Unfortunately, it did not. We see exactly the same problem, or at
least amcheck/REINDEX produces exactly the same error. I checked both
Dan's original update_freeze.patch, and your revision that retained
some of the "tupkeep" stuff,
0002-Don-t-freeze-recently-dead-HOT-tuples, which you posted on
September 6th.

I did not take the time to dig into that more than two hours, but my
first feeling is that some race condition is going on with the heap
pruning.

I'll look into this on Monday. I found out yesterday that the
problematic case is when HTSV returns HEAPTUPLE_DEAD and the HOT tests
return true. I haven't yet figured if it is one of those specifically,
but I suspect it is the IsHotUpdated case.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#2)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

So here's my attempt at an explanation for what is going on. At one
point, we have this:

select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+--------+--------+--------+----------+-----------
1 | 1 | 2 | 0 | (0,1) | 902 | 3
2 | 0 | | | | |
3 | 1 | 2 | 19928 | (0,4) | 3142 | c003
4 | 1 | 14662 | 19929 | (0,5) | 3142 | c003
5 | 1 | 14663 | 19931 | (0,6) | 3142 | c003
6 | 1 | 14664 | 19933 | (0,7) | 3142 | c003
7 | 1 | 14665 | 0 | (0,7) | 2902 | 8003
(7 filas)

which shows a HOT-update chain, where the t_xmax are multixacts. Then a
vacuum freeze comes, and because the multixacts are below the freeze
horizon for multixacts, we get this:

select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+--------+--------+--------+----------+-----------
1 | 1 | 2 | 0 | (0,1) | 902 | 3
2 | 0 | | | | |
3 | 1 | 2 | 14662 | (0,4) | 2502 | c003
4 | 1 | 2 | 14663 | (0,5) | 2502 | c003
5 | 1 | 2 | 14664 | (0,6) | 2502 | c003
6 | 1 | 2 | 14665 | (0,7) | 2502 | c003
7 | 1 | 2 | 0 | (0,7) | 2902 | 8003
(7 filas)

where the xmin values have all been frozen, and the xmax values are now
regular Xids. I think the HOT code that walks the chain fails to detect
these as chains, because the xmin values no longer match the xmax
values. I modified the multixact freeze code, so that whenever the
update Xid is below the cutoff Xid, it's set to FrozenTransactionId,
since keeping the other value is invalid anyway (even though we have set
the HEAP_XMAX_COMMITTED flag). But that still doesn't fix the problem;
as far as I can see, vacuum removes the root of the chain, not yet sure
why, and then things are just as corrupted as before.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#15)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Tue, Oct 3, 2017 at 9:48 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

which shows a HOT-update chain, where the t_xmax are multixacts. Then a
vacuum freeze comes, and because the multixacts are below the freeze
horizon for multixacts, we get this:

select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+--------+--------+--------+----------+-----------
1 | 1 | 2 | 0 | (0,1) | 902 | 3
2 | 0 | | | | |
3 | 1 | 2 | 14662 | (0,4) | 2502 | c003
4 | 1 | 2 | 14663 | (0,5) | 2502 | c003
5 | 1 | 2 | 14664 | (0,6) | 2502 | c003
6 | 1 | 2 | 14665 | (0,7) | 2502 | c003
7 | 1 | 2 | 0 | (0,7) | 2902 | 8003
(7 filas)

where the xmin values have all been frozen, and the xmax values are now
regular Xids.

I thought that we no longer store FrozenTransactionId (xid 2) as our
"raw" xmin while freezing, and yet that's what we see here. It looks
like pageinspect is looking at raw xmin (it's calling
HeapTupleHeaderGetRawXmin(), not HeapTupleHeaderGetXmin()). What's the
deal with that? Shouldn't the original xmin be preserved for forensic
analysis, following commit 37484ad2?

I guess what you mean is that this is what you see having modified the
code to actually store FrozenTransactionId as xmin once more, in an
effort to fix this?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Alvaro Herrera (#15)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Tue, Oct 3, 2017 at 9:48 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

But that still doesn't fix the problem;
as far as I can see, vacuum removes the root of the chain, not yet sure
why, and then things are just as corrupted as before.

Are you sure it's not opportunistic pruning? Another thing that I've
noticed with this problem is that the relevant IndexTuple will pretty
quickly vanish, presumably due to LP_DEAD setting (but maybe not
actually due to LP_DEAD setting).

(Studies the problem some more...)

I now think that it actually is a VACUUM problem, specifically a
problem with VACUUM pruning. You see the HOT xmin-to-xmax check
pattern that you mentioned within heap_prune_chain(), which looks like
where the incorrect tuple prune (or possibly, at times, redirect?)
takes place. (I refer to the prune/kill that you mentioned today, that
frustrated your first attempt at a fix -- "I modified the multixact
freeze code...".)

The attached patch "fixes" the problem -- I cannot get amcheck to
complain about corruption with this applied. And, "make check-world"
passes. Hopefully it goes without saying that this isn't actually my
proposed fix. It tells us something that this at least *masks* the
problem, though; it's a start.

FYI, the repro case page contents looks like this with the patch applied:

postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+---------+--------+--------+----------+-----------
1 | 1 | 1845995 | 0 | (0,1) | b02 | 3
2 | 2 | | | | |
3 | 0 | | | | |
4 | 0 | | | | |
5 | 0 | | | | |
6 | 0 | | | | |
7 | 1 | 1846001 | 0 | (0,7) | 2b02 | 8003
(7 rows)

--
Peter Geoghegan

Attachments:

suppress-bad-prune.patchtext/x-patch; charset=US-ASCII; name=suppress-bad-prune.patchDownload+0-7
#18Michael Paquier
michael@paquier.xyz
In reply to: Peter Geoghegan (#17)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Wed, Oct 4, 2017 at 8:10 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I now think that it actually is a VACUUM problem, specifically a
problem with VACUUM pruning. You see the HOT xmin-to-xmax check
pattern that you mentioned within heap_prune_chain(), which looks like
where the incorrect tuple prune (or possibly, at times, redirect?)
takes place. (I refer to the prune/kill that you mentioned today, that
frustrated your first attempt at a fix -- "I modified the multixact
freeze code...".)

My lookup of the problem converges to the same conclusion. Something
is wrong with the vacuum's pruning. I have spent some time trying to
design a patch, all the solutions I tried have proved to make the
problem harder to show up, but it still showed up, sometimes after
dozens of attempts.

The attached patch "fixes" the problem -- I cannot get amcheck to
complain about corruption with this applied. And, "make check-world"
passes. Hopefully it goes without saying that this isn't actually my
proposed fix. It tells us something that this at least *masks* the
problem, though; it's a start.

Yep.

FYI, the repro case page contents looks like this with the patch applied:
postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+---------+--------+--------+----------+-----------
1 | 1 | 1845995 | 0 | (0,1) | b02 | 3
2 | 2 | | | | |
3 | 0 | | | | |
4 | 0 | | | | |
5 | 0 | | | | |
6 | 0 | | | | |
7 | 1 | 1846001 | 0 | (0,7) | 2b02 | 8003
(7 rows)

Is lp_off for tid (0,2) pointing to (0,7)? A hot chain preserved is
what would look correct to me.

- * Check the tuple XMIN against prior XMAX, if any
- */
- if (TransactionIdIsValid(priorXmax) &&
- !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
- break;
If you remove this check, you could also remove completely priorXmax.

Actually, I may be missing something, but why is priorXmax updated
even for dead tuples? For example just doing that is also taking care
of the problem:
@@ -558,7 +558,8 @@ heap_prune_chain(Relation relation, Buffer buffer,
OffsetNumber rootoffnum,
        Assert(ItemPointerGetBlockNumber(&htup->t_ctid) ==
               BufferGetBlockNumber(buffer));
        offnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
-       priorXmax = HeapTupleHeaderGetUpdateXid(htup);
+       if (!tupdead)
+           priorXmax = HeapTupleHeaderGetUpdateXid(htup);
    }
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Wood, Dan
hexpert@amazon.com
In reply to: Michael Paquier (#18)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

I’ve just started looking at this again after a few weeks break.

There is a tangled web of issues here. With the community fix we get a corrupted page(invalid redirect ptr from indexed item). The cause of that is:
pruneheap.c:

/*
* Check the tuple XMIN against prior XMAX, if any
*/
if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;

chainitems[nchain++] = offnum;

The priorXmax is a multixact key share lock, thus not equal to xmin. This terminates the scan. Unlike the scan termination with “if (!recent_dead) break;” below the switch (...SatisiesVacuum…), this “break” does not put the offnum into the chain even though it is in the chain. If the first not-deleted item isn’t put in the chain then we’ll not call heap_prune_record_redirect().

I do not know what the above ‘if’ test is protecting. Normally the xmin is equal to the priorXmax. Other than protecting against corruption a key share lock can cause this. So, I tried a fix which does the “if” check after adding it to chainitems. This will break whatever real situation this IF was protecting. Tom Lane put this in.

OK: With that hack of a fix the redirect now works correctly. However, we still get the reindex problem with not finding the parent. That problem is caused by:
Pruneheap.c:heap_get_root_tuples()

if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(htup)))
break;

In this case, instead of these not being equal because of a multixact key share lock, it is because XMIN is frozen and FrozenTransactionId doesn’t equal the priorXmax. Thus, we don’t build the entire chain from root to most current row version and this causes the reindex failure.

If we disable this ‘if’ as a hack then we no longer get a problem on the reindex. However, YiWen reported that at the end of an install check out index checking reported corruption in the system catalogues. So we are still looking.

We need to understand why these TXID equal checks exist. Can we differentiate the cases they are protecting against with the two exceptions I’ve found?

FYI, someone should look at the same ”if” test in heapam.c: heap_lock_updated_tuple_rec(). Also, I hope there are no strange issues with concurrent index builds.

Finally, the idea behind the original fix was to simply NOT to do an unsupported freeze on a dead tuple. It had two drawbacks:
1) CLOG truncation. This could have been handled by keeping track of the old unpruned item found and using that to update the table’s/DB’s freeze xid.
2) Not making freeze progress. The only reason the prune would fail should be because of an open TXN. Unless that TXN was so old such that it’s XID was as old as the ?min freeze threshold? then we would make progress. If we were doing TXN’s that old then we’d be having problems anyway.

On 10/3/17, 5:15 PM, "Michael Paquier" <michael.paquier@gmail.com> wrote:

On Wed, Oct 4, 2017 at 8:10 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I now think that it actually is a VACUUM problem, specifically a
problem with VACUUM pruning. You see the HOT xmin-to-xmax check
pattern that you mentioned within heap_prune_chain(), which looks like
where the incorrect tuple prune (or possibly, at times, redirect?)
takes place. (I refer to the prune/kill that you mentioned today, that
frustrated your first attempt at a fix -- "I modified the multixact
freeze code...".)

My lookup of the problem converges to the same conclusion. Something
is wrong with the vacuum's pruning. I have spent some time trying to
design a patch, all the solutions I tried have proved to make the
problem harder to show up, but it still showed up, sometimes after
dozens of attempts.

The attached patch "fixes" the problem -- I cannot get amcheck to
complain about corruption with this applied. And, "make check-world"
passes. Hopefully it goes without saying that this isn't actually my
proposed fix. It tells us something that this at least *masks* the
problem, though; it's a start.

Yep.

FYI, the repro case page contents looks like this with the patch applied:
postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+---------+--------+--------+----------+-----------
1 | 1 | 1845995 | 0 | (0,1) | b02 | 3
2 | 2 | | | | |
3 | 0 | | | | |
4 | 0 | | | | |
5 | 0 | | | | |
6 | 0 | | | | |
7 | 1 | 1846001 | 0 | (0,7) | 2b02 | 8003
(7 rows)

Is lp_off for tid (0,2) pointing to (0,7)? A hot chain preserved is
what would look correct to me.

- * Check the tuple XMIN against prior XMAX, if any
- */
- if (TransactionIdIsValid(priorXmax) &&
- !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
- break;
If you remove this check, you could also remove completely priorXmax.

    Actually, I may be missing something, but why is priorXmax updated
    even for dead tuples? For example just doing that is also taking care
    of the problem:
    @@ -558,7 +558,8 @@ heap_prune_chain(Relation relation, Buffer buffer,
    OffsetNumber rootoffnum,
            Assert(ItemPointerGetBlockNumber(&htup->t_ctid) ==
                   BufferGetBlockNumber(buffer));
            offnum = ItemPointerGetOffsetNumber(&htup->t_ctid);
    -       priorXmax = HeapTupleHeaderGetUpdateXid(htup);
    +       if (!tupdead)
    +           priorXmax = HeapTupleHeaderGetUpdateXid(htup);
        }
    -- 
    Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Michael Paquier (#18)
Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

On Tue, Oct 3, 2017 at 5:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

FYI, the repro case page contents looks like this with the patch applied:
postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
----+----------+---------+--------+--------+----------+-----------
1 | 1 | 1845995 | 0 | (0,1) | b02 | 3
2 | 2 | | | | |
3 | 0 | | | | |
4 | 0 | | | | |
5 | 0 | | | | |
6 | 0 | | | | |
7 | 1 | 1846001 | 0 | (0,7) | 2b02 | 8003
(7 rows)

Is lp_off for tid (0,2) pointing to (0,7)? A hot chain preserved is
what would look correct to me.

Yes, it is:

postgres=# select * from bt_page_items('foo', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 03 00 00 00 00 00 00 00
(2 rows)

I can tell from looking at my hex editor that the 4 bytes of ItemId
that we see for position '(0,2)' in the ItemId array are "07 00 01
00", meaning that '(0,2)' this is a LP_REDIRECT item, repointing us to
'(0,7)'. Everything here looks sane to me, at least at first blush.

- * Check the tuple XMIN against prior XMAX, if any
- */
- if (TransactionIdIsValid(priorXmax) &&
- !TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
- break;
If you remove this check, you could also remove completely priorXmax.

Actually, I may be missing something, but why is priorXmax updated
even for dead tuples? For example just doing that is also taking care
of the problem:

I'll study what you suggest here some more tomorrow.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Wood, Dan (#19)
#22Wood, Dan
hexpert@amazon.com
In reply to: Peter Geoghegan (#21)
#23Wong, Yi Wen
yiwong@amazon.com
In reply to: Peter Geoghegan (#21)
#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#16)
#25Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Wood, Dan (#19)
#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Wood, Dan (#22)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#24)
#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Wong, Yi Wen (#23)
In reply to: Wong, Yi Wen (#23)
#30Wood, Dan
hexpert@amazon.com
In reply to: Alvaro Herrera (#25)
In reply to: Wood, Dan (#30)
In reply to: Alvaro Herrera (#28)
#33Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#28)
#34Wood, Dan
hexpert@amazon.com
In reply to: Michael Paquier (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Wood, Dan (#34)
#36Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Wood, Dan (#34)
#37Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#32)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#37)
In reply to: Alvaro Herrera (#37)
#40Wood, Dan
hexpert@amazon.com
In reply to: Michael Paquier (#35)
#41Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#38)
#42Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#41)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Wood, Dan (#40)
#44Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#42)
#45Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#43)
#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#45)
#47Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#46)
#48Stephen Frost
sfrost@snowman.net
In reply to: Alvaro Herrera (#46)
#49Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#43)
In reply to: Alvaro Herrera (#43)
#51Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#50)
In reply to: Alvaro Herrera (#49)
In reply to: Alvaro Herrera (#51)
In reply to: Peter Geoghegan (#53)
#55Wong, Yi Wen
yiwong@amazon.com
In reply to: Peter Geoghegan (#54)
In reply to: Wong, Yi Wen (#55)
#57Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#56)
In reply to: Alvaro Herrera (#57)
#59Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#58)
In reply to: Alvaro Herrera (#59)
#61Wood, Dan
hexpert@amazon.com
In reply to: Alvaro Herrera (#59)
#62Michael Paquier
michael@paquier.xyz
In reply to: Peter Geoghegan (#60)
#63Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Wood, Dan (#61)
#64Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#63)
#65Wood, Dan
hexpert@amazon.com
In reply to: Michael Paquier (#64)
#66Michael Paquier
michael@paquier.xyz
In reply to: Wood, Dan (#65)
#67Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#1)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#67)
#69Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#67)
#70Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#69)
#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#70)
#72Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#71)
#73Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Stephen Frost (#72)
#74Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#73)
In reply to: Andres Freund (#67)
In reply to: Robert Haas (#74)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#74)
#78Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#76)
#79Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#77)
#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#79)
#81Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#70)
In reply to: Andres Freund (#81)
#83Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#82)
In reply to: Alvaro Herrera (#83)
#85Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#83)
#86Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#82)
#87Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#86)
#88Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#86)
In reply to: Andres Freund (#88)
#90Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#89)
In reply to: Andres Freund (#90)
In reply to: Andres Freund (#90)
#93Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#92)
In reply to: Andres Freund (#67)
#95Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#81)
#96Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#95)
#97Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#95)
#98Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#95)
#99Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#98)
#100Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#98)
#101Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#100)
#102Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#95)
#103Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#95)
#104Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#102)
#105Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#103)
#106Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#102)
#107Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#103)
#108Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#104)
#109Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#107)
#110Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#109)
#111Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#105)
#112Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#95)
#113Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#112)
#114Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#113)
In reply to: Andres Freund (#113)
#116Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#114)
#117Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#115)
In reply to: Andres Freund (#117)
In reply to: Peter Geoghegan (#118)
#120Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#119)
#121Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#95)
#122Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#121)
#123Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#122)
#124Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#121)