"failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

Started by Peter Geogheganabout 8 years ago7 messages
1 attachment(s)

Commit d70cf811, from 2014, promoted an Assert() within
IndexBuildHeapScan() to a "can't happen" elog() error, in order to
detect when a parent tuple cannot be found for some heap-only tuple --
if this happens, then it indicates corruption. I think that we should
make it a full ereport(), with an errcode of ERRCODE_DATA_CORRUPTION,
to match what Andres just added to code that deals with freezing (he
promoted Assert()s to errors, just like the 2014 commit, though he
went as far as making them ereport()s to begin with). Attached patch
does this.

I propose a backpatch to 9.3, partially for the sake of tools like
amcheck, where users may only be on the lookout for
ERRCODE_DATA_CORRUPTION and ERRCODE_INDEX_CORRUPTED.

FWIW, an old MultiXact/recovery bug, alluded to by the commit message
of d70cf811 [1]/messages/by-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com (and fixed by 6bfa88acd) was the cause of some déjà vu
for me while looking into the "freeze the dead" issues. Because the
enhanced amcheck [2]https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification actually raised this error when I went to verify
the first "freeze the dead" bugfix [3]https://postsgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan, it's clearly effective as a
test for certain types of corruption. If CREATE
INDEX/IndexBuildHeapScan() didn't already perform this check, then it
would probably be necessary for amcheck to implement it on its own.
What heap_get_root_tuples() does for us here is ideally suited to
finding inconsistencies in HOT chains, because it matches xmin against
xmax, looks at line pointer bits/redirects, and consults pg_multixact
if necessary. The only thing that it *doesn't* do is make sure that
hint bits accurately reflect what it says in the CLOG -- we'll need to
find another way to do that, by directly targeting heap relations with
their own function. In short, it does an awful lot for tools like
amcheck, and I want to make sure that we get the full benefit of that.

[1]: /messages/by-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=O67w0cSswPvV7XfUcU5g@mail.gmail.com
[2]: https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification
[3]: https://postsgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

0001-Promote-HOT-parent-tuple-elog-to-an-ereport.patchtext/x-patch; charset=US-ASCII; name=0001-Promote-HOT-parent-tuple-elog-to-an-ereport.patchDownload
From 0a37ceea53736e39d0f1af72ec9bc07d98833370 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 15 Dec 2017 14:05:16 -0800
Subject: [PATCH] Promote "HOT parent tuple" elog to an ereport.

Commit d70cf811, from 2014, promoted an assert within
IndexBuildHeapScan() (and another similar assert) to a "can't happen"
elog() error, in order to detect when a parent tuple cannot be found for
some heap-only tuple during CREATE INDEX/REINDEX.  When the error is
raised, it indicates heap corruption.  This has proven useful as a
corruption smoke-test while investigating recent corruption-inducing
bugs in pruning/freezing.

This commit promotes those elog() errors to ereport() errors.  The
errors are emitted with ereport rather than elog, despite being "should
never happen" messages, so a proper error code is emitted.  To avoid
superfluous translations, mark messages as internal.

This is done primarily for the benefit of tools like amcheck, that may
want to piggy-back on IndexBuildHeapScan() to perform integrity
checking.  It's worth spelling out the nature of the problem for users
of these tools.

Author: Peter Geoghegan
Reviewed-By: Andres Freund
Backpatch: 9.3-
---
 src/backend/catalog/index.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0125c18..e43482d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2595,9 +2595,12 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 			offnum = ItemPointerGetOffsetNumber(&heapTuple->t_self);
 
 			if (!OffsetNumberIsValid(root_offsets[offnum - 1]))
-				elog(ERROR, "failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-					 ItemPointerGetBlockNumber(&heapTuple->t_self),
-					 offnum, RelationGetRelationName(heapRelation));
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
+										 ItemPointerGetBlockNumber(&heapTuple->t_self),
+										 offnum,
+										 RelationGetRelationName(heapRelation))));
 
 			ItemPointerSetOffsetNumber(&rootTuple.t_self,
 									   root_offsets[offnum - 1]);
@@ -3060,10 +3063,12 @@ validate_index_heapscan(Relation heapRelation,
 		{
 			root_offnum = root_offsets[root_offnum - 1];
 			if (!OffsetNumberIsValid(root_offnum))
-				elog(ERROR, "failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-					 ItemPointerGetBlockNumber(heapcursor),
-					 ItemPointerGetOffsetNumber(heapcursor),
-					 RelationGetRelationName(heapRelation));
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
+										 ItemPointerGetBlockNumber(heapcursor),
+										 ItemPointerGetOffsetNumber(heapcursor),
+										 RelationGetRelationName(heapRelation))));
 			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
 		}
 
-- 
2.7.4

#2David Steele
david@pgmasters.net
In reply to: Peter Geoghegan (#1)
Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

On 12/15/17 5:31 PM, Peter Geoghegan wrote:

Commit d70cf811, from 2014, promoted an Assert() within
IndexBuildHeapScan() to a "can't happen" elog() error, in order to
detect when a parent tuple cannot be found for some heap-only tuple --
if this happens, then it indicates corruption. I think that we should
make it a full ereport(), with an errcode of ERRCODE_DATA_CORRUPTION,
to match what Andres just added to code that deals with freezing (he
promoted Assert()s to errors, just like the 2014 commit, though he
went as far as making them ereport()s to begin with). Attached patch
does this.

I propose a backpatch to 9.3, partially for the sake of tools like
amcheck, where users may only be on the lookout for
ERRCODE_DATA_CORRUPTION and ERRCODE_INDEX_CORRUPTED.

This patch applies, passes testing, and appears very straight-forward to
me. Are there any tests for these conditions currently, or are you only
doing that in amcheck?

--
-David
david@pgmasters.net

In reply to: David Steele (#2)
Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

On Thu, Mar 1, 2018 at 11:15 AM, David Steele <david@pgmasters.net> wrote:

On 12/15/17 5:31 PM, Peter Geoghegan wrote:

Commit d70cf811, from 2014, promoted an Assert() within
IndexBuildHeapScan() to a "can't happen" elog() error, in order to
detect when a parent tuple cannot be found for some heap-only tuple --
if this happens, then it indicates corruption. I think that we should
make it a full ereport(), with an errcode of ERRCODE_DATA_CORRUPTION,
to match what Andres just added to code that deals with freezing (he
promoted Assert()s to errors, just like the 2014 commit, though he
went as far as making them ereport()s to begin with). Attached patch
does this.

I propose a backpatch to 9.3, partially for the sake of tools like
amcheck, where users may only be on the lookout for
ERRCODE_DATA_CORRUPTION and ERRCODE_INDEX_CORRUPTED.

This patch applies, passes testing, and appears very straight-forward to
me. Are there any tests for these conditions currently, or are you only
doing that in amcheck?

The enhanced amcheck in the current CF only tests these conditions
indirectly, by pretending to be a CREATE INDEX statement. It matters
to amcheck because these conditions are pretty plausible symptoms of
corruption, and I can imagine someone missing them because they are
not either ERRCODE_DATA_CORRUPTION or ERRCODE_INDEX_CORRUPTED.

If amcheck didn't exist, then I'd still think that this patch was worthwhile.

--
Peter Geoghegan

#4David Steele
david@pgmasters.net
In reply to: Peter Geoghegan (#3)
Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

On 3/1/18 2:19 PM, Peter Geoghegan wrote:

On Thu, Mar 1, 2018 at 11:15 AM, David Steele <david@pgmasters.net> wrote:

On 12/15/17 5:31 PM, Peter Geoghegan wrote:

Commit d70cf811, from 2014, promoted an Assert() within
IndexBuildHeapScan() to a "can't happen" elog() error, in order to
detect when a parent tuple cannot be found for some heap-only tuple --
if this happens, then it indicates corruption. I think that we should
make it a full ereport(), with an errcode of ERRCODE_DATA_CORRUPTION,
to match what Andres just added to code that deals with freezing (he
promoted Assert()s to errors, just like the 2014 commit, though he
went as far as making them ereport()s to begin with). Attached patch
does this.

I propose a backpatch to 9.3, partially for the sake of tools like
amcheck, where users may only be on the lookout for
ERRCODE_DATA_CORRUPTION and ERRCODE_INDEX_CORRUPTED.

This patch applies, passes testing, and appears very straight-forward to
me. Are there any tests for these conditions currently, or are you only
doing that in amcheck?

The enhanced amcheck in the current CF only tests these conditions
indirectly, by pretending to be a CREATE INDEX statement. It matters
to amcheck because these conditions are pretty plausible symptoms of
corruption, and I can imagine someone missing them because they are
not either ERRCODE_DATA_CORRUPTION or ERRCODE_INDEX_CORRUPTED.

If amcheck didn't exist, then I'd still think that this patch was worthwhile.

Yes, I agree. My thrust was more to discover if there is any testing
for these conditions being done in core. It sounds like no, but I don't
think it's the responsibility of this patch to add them.

I'll mark it Ready for Committer.

--
-David
david@pgmasters.net

In reply to: David Steele (#4)
Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

On Thu, Mar 1, 2018 at 11:23 AM, David Steele <david@pgmasters.net> wrote:

Yes, I agree. My thrust was more to discover if there is any testing
for these conditions being done in core. It sounds like no, but I don't
think it's the responsibility of this patch to add them.

The only tests possible with stock Postgres are performed by CREATE
INDEX/REINDEX. That generally isn't practical as a general smoke test,
not least because you cannot do a CREATE INDEX on a standby -- unlike
the amcheck heapallindexed test.

--
Peter Geoghegan

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#5)
Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

Peter Geoghegan <pg@bowt.ie> writes:

On Thu, Mar 1, 2018 at 11:23 AM, David Steele <david@pgmasters.net> wrote:

Yes, I agree. My thrust was more to discover if there is any testing
for these conditions being done in core. It sounds like no, but I don't
think it's the responsibility of this patch to add them.

The only tests possible with stock Postgres are performed by CREATE
INDEX/REINDEX.

Yeah, I see no practical way to test this error case, and little point in
asking this patch to do anything about it. If we had some test rig for
injecting corrupted data, it might be interesting to look at ...

Patch looks fine to me, will push.

regards, tom lane

In reply to: Tom Lane (#6)
Re: "failed to find parent tuple for heap-only tuple" error as an ERRCODE_DATA_CORRUPTION ereport()

On Thu, Mar 1, 2018 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Patch looks fine to me, will push.

Thank you.

--
Peter Geoghegan