Corrupted btree index on HEAD because of covering indexes

Started by Michael Paquierabout 8 years ago31 messageshackers

michael@paquier.xyz

about 8 years ago

Hi all,

I was just testing the VACUUM truncation logic, and bumped into what
looks like a corrupted btree index. Here is a reproducer:
create table aa (a int primary key, b bool);
insert into aa values (generate_series(1,1000000), false);
checkpoint;
update aa set b = false where a > 500000; -- Dirties a set of shared
buffers
delete from aa where a > 750000; -- Delete a set of rows
vacuum aa;
delete from aa where a > 10;
vacuum aa; -- error on btree with right sibling

And here is the actual failure when the second vacuum:
ERROR: XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in index "aa_pkey"
LOCATION: _bt_mark_page_halfdead, nbtpage.c:1564

This works on REL_10_STABLE, so I am adding an open item. I have not
investigated the exact problem yet, but bisect is showing me covering
indexes as the culprit (8224de4).

Thanks,
--
Michael

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Michael Paquier (#1)

Re: Corrupted btree index on HEAD because of covering indexes

Will see...

Michael Paquier wrote:

Hi all,

I was just testing the VACUUM truncation logic, and bumped into what
looks like a corrupted btree index. Here is a reproducer:
create table aa (a int primary key, b bool);
insert into aa values (generate_series(1,1000000), false);
checkpoint;
update aa set b = false where a > 500000; -- Dirties a set of shared
buffers
delete from aa where a > 750000; -- Delete a set of rows
vacuum aa;
delete from aa where a > 10;
vacuum aa; -- error on btree with right sibling

And here is the actual failure when the second vacuum:
ERROR: XX000: right sibling 4132 of block 2128 is not next child 5396 of block 412 in index "aa_pkey"
LOCATION: _bt_mark_page_halfdead, nbtpage.c:1564

This works on REL_10_STABLE, so I am adding an open item. I have not
investigated the exact problem yet, but bisect is showing me covering
indexes as the culprit (8224de4).

Thanks,
--
Michael

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#2)

Re: Corrupted btree index on HEAD because of covering indexes

On Wed, Apr 18, 2018 at 10:29 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

Will see...

I'll take a look tomorrow.

--
Peter Geoghegan

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Peter Geoghegan (#3)

Re: Corrupted btree index on HEAD because of covering indexes

I'll take a look tomorrow.

Interesting, contrib/amcheck doesn't find any error in index. Seems, it's
subject for further improvement.

Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated high
key IndexTuple as a storage of blocknumber of top parent to remove. And sets
BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.

But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key with
ItemPointerIsValid macro - and it returns false, because offset is actually
InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts. And some wrong
decisions are follows, I didn't look at that.

Trivial and naive fix is attached, but for me it looks a bit annoing that we
store pointer (leafhikey) somewhere inside unlocked page.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#4)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 9:42 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

Interesting, contrib/amcheck doesn't find any error in index. Seems, it's
subject for further improvement.

I think that you're right that this should be detectable by
bt_index_parent_check(). I have an idea about how we can make this
happen with low overhead:

Iff we're readonly (that is, called through bt_index_parent_check()),
use a bloom filter to fingerprint downlink block numbers on each
non-leaf level, starting from the root. On the next level down, check
that we encountered a downlink at the parent level for non-P_IGNORE()
pages. If we didn't, throw an error. This should only need a small
Bloom filter to work well. I suppose we could only do it when
"heapallindexed = true" was specified to bt_index_parent_check(), and
size the Bloom filter based on the same principle as heapallindexed
verification.

This check is correct because the downlink in the parent is atomically
removed when the page (that the downlink points to) is marked
half-dead by VACUUM (and because there cannot be a concurrent VACUUM,
since we have a ShareLock on the relation). If we find a
non-P_IGNORE() page whose block number was not fingerprinted one level
up, then something must be wrong. (We should also throw an error
within bt_downlink_check() if the page turns out to be P_IGNORE(), to
make our testing "symmetrical".)

The really scary thing about corruption like this is not that it leads
to wrong answers to queries. It's that it *doesn't* lead to wrong
answers to queries (I mentioned this issue to Alexander a few weeks
ago, actually). This is true because we'll move right one level down,
maybe more than once, because it just looks like there was a
concurrent page split to index scans.

(There is actually another subtlety about this new
bt_index_parent_check() check, which is legitimate incomplete page
splits that we detect using P_INCOMPLETE_SPLIT(). The proposed check
still seems totally doable, though.)

I would like to go and implement this check now, and if all goes well
I may propose that you commit it to the master branch for v11. I don't
see this as a new feature. I see it as essential testing
infrastructure. What do you think about adding this new check soon?

Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated
high key IndexTuple as a storage of blocknumber of top parent to remove. And
sets BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.

But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
with ItemPointerIsValid macro - and it returns false, because offset is
actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
And some wrong decisions are follows, I didn't look at that.

I'm not at all surprised. I strongly suspected that it was some simple
issue with the representation, since the INCLUDE patch didn't actually
change the page deletion algorithm in any way.

Trivial and naive fix is attached, but for me it looks a bit annoing that we
store pointer (leafhikey) somewhere inside unlocked page.

Well, it has worked that way for a long time. No reason to change it now.

I also think that we could have better conventional regression test
coverage here.

--
Peter Geoghegan

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Peter Geoghegan (#5)

Re: Corrupted btree index on HEAD because of covering indexes

I would like to go and implement this check now, and if all goes well
I may propose that you commit it to the master branch for v11. I don't
see this as a new feature. I see it as essential testing
infrastructure. What do you think about adding this new check soon?

Agree. Hope, nobody found a way how to use amcheck module in production
to serve user requests. But, it should be implemented before BETA stage,
in opposite case we will get a lot of objections.

Nevertheless, seems, I found. In _bt_mark_page_halfdead() we use truncated
high key IndexTuple as a storage of blocknumber of top parent to remove. And
sets BTreeTupleSetNAtts(&trunctuple, 0) - it's stored in ip_posid.

But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
with ItemPointerIsValid macro - and it returns false, because offset is
actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
And some wrong decisions are follows, I didn't look at that.

I'm not at all surprised. I strongly suspected that it was some simple
issue with the representation, since the INCLUDE patch didn't actually
change the page deletion algorithm in any way.

Trivial and naive fix is attached, but for me it looks a bit annoing that we
store pointer (leafhikey) somewhere inside unlocked page.

Well, it has worked that way for a long time. No reason to change it now.

Agree, but at least this place needs a comment - why it's safe.

I also think that we could have better conventional regression test
coverage here.

Will try to invent not so large test.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Peter Geoghegan (#5)

Re: Corrupted btree index on HEAD because of covering indexes

I also think that we could have better conventional regression test
coverage here.

I tried to minimize Michael's test case and add it to patch.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

Alexander Korotkov

aekorotkov@gmail.com

about 8 years ago

In reply to: Teodor Sigaev (#7)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 10:47 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

I also think that we could have better conventional regression test

coverage here.

I tried to minimize Michael's test case and add it to patch.

- if (ItemPointerIsValid(leafhikey))
+ if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)

Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Alexander Korotkov (#8)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 2:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

- if (ItemPointerIsValid(leafhikey))
+ if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)
Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?

+1 to at least using BTreeInnerTupleGetDownLink().

I think it's okay to use ItemPointerGetBlockNumberNoCheck() in
amcheck, because it's inconvenient for amcheck to know if it's on the
leaf level or not (it's rather low-context code). However, the nbtree
code doesn't have that problem, and already has very few places that
could possibly need ItemPointerGetBlockNumberNoCheck(). In fact,
Teodor's fix touches the only remaining
ItemPointerGetBlockNumberNoCheck() caller code within all of nbtree.

In summary, I think that we should find a way to never use
ItemPointerGetBlockNumberNoCheck() in nbtree, while still using it in
amcheck.

--
Peter Geoghegan

#10

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#6)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 11:59 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

Agree. Hope, nobody found a way how to use amcheck module in production to
serve user requests. But, it should be implemented before BETA stage, in
opposite case we will get a lot of objections.

It shouldn't take that long to get a patch together. I've already started.

I think that any objections to pushing this amcheck enhancement after
feature freeze are unlikely, for these reasons:

* There is a precedent, since the integrity checking functions added
to pg_visibility by commit e472ce96 were added well after feature
freeze, and only 3 months before the release of 9.6.

* Anyone paying attention to the ongoing "VM map freeze corruption"
thread would have to agree that that was an extremely good idea. We're
the ones that have to maintain this code, and I think that we're
entitled to add a little extra C code to a contrib extension, in order
to make testing easier. Including for beta testing.

* The extra overhead is quite low, and will only affect
bt_index_parent_check() (not bt_index_check()).

* I deliberately created bt_index_parent_check() so that we'd have
something that does very thorough testing, that's too slow and/or
requires too heavy a relation-level lock to be useful to most users.
That was the plan from day one.

If you only care about hardware issues, then bt_index_check() is
better in almost every way, and is really all you should be using in
production unless you really don't care about the extra overhead, or
about the ShareLock. The docs more or less say this. I have certainly
heard of people using bt_index_parent_check() in production, but only
when they already knew that their database was corrupt, and wanted to
isolate the problem. I imagine that people that use
bt_index_parent_check() in production do so because they want as much
information as possible, and are willing to do whatever it takes to
get more information.

Agree, but at least this place needs a comment - why it's safe.

Good idea.

I also think that we could have better conventional regression test
coverage here.

Will try to invent not so large test.oif it means they may get a little extra

Your smaller test takes about 350ms to run on a debug build on my
laptop, which seems worth it (honestly, this test should have been
there already). Maybe we can add it to the amcheck regression tests
instead, since those are run less often. This also allows us to add a
test case specifically as part of the amcheck enhancement, which makes
a bit more sense.

--
Peter Geoghegan

#11

Michael Paquier

michael@paquier.xyz

about 8 years ago

In reply to: Teodor Sigaev (#7)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 10:47:02PM +0300, Teodor Sigaev wrote:

I tried to minimize Michael's test case and add it to patch.

I cannot comment on the actual fix as I lack background in the area, but
having a test case and even more having pg_upgrade do some work on those
pages are good things. This addresses the failure as far as I can see.
--
Michael

#12

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Alexander Korotkov (#8)

Re: Corrupted btree index on HEAD because of covering indexes

I tried to minimize Michael's test case and add it to patch.
-if (ItemPointerIsValid(leafhikey))
+if (ItemPointerGetBlockNumberNoCheck(leafhikey) != InvalidBlockNumber)
Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?

I am not sure. Here we actually store UP link - to top parent to remove.
I'm afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink
could cause a confusion, in other hand, introducing
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#13

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Peter Geoghegan (#10)

Re: Corrupted btree index on HEAD because of covering indexes

heard of people using bt_index_parent_check() in production, but only
when they already knew that their database was corrupt, and wanted to
isolate the problem. I imagine that people that use
bt_index_parent_check() in production do so because they want as much
information as possible, and are willing to do whatever it takes to
get more information.

That why I think we need improve amcheck - ideally, it should not miss
any mistakes in tree structure.

Agree, but at least this place needs a comment - why it's safe.

Good idea.

Could you write it? I'm afraid I can't give good explanation why we
believe that nobody update this page ant it's high key while page is
unlocked but pinned.

I also think that we could have better conventional regression test
coverage here.

Will try to invent not so large test.oif it means they may get a little extra

Your smaller test takes about 350ms to run on a debug build on my
laptop, which seems worth it (honestly, this test should have been
there already). Maybe we can add it to the amcheck regression tests
instead, since those are run less often. This also allows us to add a
test case specifically as part of the amcheck enhancement, which makes
a bit more sense.

Hm, it seems to me, that 350ms is short enough to place it in both core
and amcheck test. I think, basic functionality should be covered by core
tests as we test insert/create.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#14

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Teodor Sigaev (#12)

Re: Corrupted btree index on HEAD because of covering indexes

Should we use BTreeInnerTupleGetDownLink() as soon as we use
BTreeInnerTupleSetDownLink() for setting this?
Or even invent BTreeInnerTupleDownLinkIsValid() macro?

I am not sure. Here we actually store UP link - to top parent to remove. I'm
afraid using BTreeInnerTupleGetDownLink/BTreeInnerTupleSetDownLink could cause a
confusion, in other hand, introducing
TreeInnerTupleGetUpLink/BTreeInnerTupleSetUpLink seems over-engineering

After close look I change my opinion. To have a clean code it's much better to
have new pair get/set macroses specialy to manage link to top pare during page
deletion. This removes last naked usage of
ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses self-described
macroses. Patch is attached.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#15

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#13)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 11:41 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:

heard of people using bt_index_parent_check() in production, but only
when they already knew that their database was corrupt, and wanted to
isolate the problem. I imagine that people that use
bt_index_parent_check() in production do so because they want as much
information as possible, and are willing to do whatever it takes to
get more information.

That why I think we need improve amcheck - ideally, it should not miss any
mistakes in tree structure.

I have finished writing a prototype that works, and passes all tests.
Without the fix, this is what the complaint from VACUUM looks like on
my machine:

ERROR: right sibling 265 of block 134 is not next child 395 of block
135 in index "delete_test_table_pkey"

After the final VACUUM in your test case runs (the one that throws
this error), my working copy of amcheck can detect the same issue with
low overhead:

pg@~[20995]=# select bt_index_parent_check('delete_test_table_pkey', true);
ERROR: index block lacks downlink in index "delete_test_table_pkey"
DETAIL: Block=265 level=1 page lsn=0/909F4B18.

There is no complaint from amcheck if called before the final VACUUM
in your test case, since that's when the trouble starts (and ends).
It's easy to account for concurrent page splits within amcheck by
checking for BTP_SPLIT_END on the child page that could lack a
downlink. Don't confuse this with checking the BTP_INCOMPLETE_SPLIT
flag, which is what we set on the left half/sibling of a split -- not
the "new" right half/sibling. Of course, the right sibling is the type
of block that can lack a downlink until after the second phase of a
pagesplit (i.e. until it's fully finished), or after the first phase
of page deletion. (As you know, page deletion is like a page split "in
reverse".)

I said concurrent page split, but I really meant incomplete page
split, since a concurrent insert/split is impossible given that
amcheck holds a ShareLock here. (See commit 40dae7ec5.)

Of course, we don't see details about the next level up in the amcheck
error, unlike the VACUUM error. That shouldn't matter, because we're
already verifying a lot about the relationship between blocks at the
next level up, and even between the two levels. This enhancement adds
the one piece we were missing. It could in theory detect other
problems that VACUUM cannot detect, too.

Expect me to post a patch during the work day on Monday or Tuesday
(Pacific time). I need to polish this patch some more. I particularly
need to think about the use of memory for the Bloom filter (right now
I just use work_mem).

Could you write it? I'm afraid I can't give good explanation why we believe
that nobody update this page ant it's high key while page is unlocked but
pinned.

Okay. I'll come up with something soon.

Hm, it seems to me, that 350ms is short enough to place it in both core and
amcheck test. I think, basic functionality should be covered by core tests
as we test insert/create.

I think you're right. There is really no excuse for not having
thorough code coverage for a module like nbtree.

--
Peter Geoghegan

#16

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#14)

Re: Corrupted btree index on HEAD because of covering indexes

On Fri, Apr 20, 2018 at 7:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

After close look I change my opinion. To have a clean code it's much better
to have new pair get/set macroses specialy to manage link to top pare during
page deletion. This removes last naked usage of
ItemPointer(SetInvalid/IsInvalid/GetBlockNumberNoCheck) and uses
self-described macroses. Patch is attached.

I see your point. Maybe don't have the newline between the get and
set, though, to match the existing style. And, the note about the
assertion seems unnecessary.

I suggest putting something about what general area this deals with.
Perhaps something like the following:

"Get/set leaf page highkey's link. During the second phase of
deletion, the target leaf page's high key may point to an ancestor
page (at all other times, the leaf level high key's link is not used).
See the nbtree README for full details."

--
Peter Geoghegan

#17

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#4)

Re: Corrupted btree index on HEAD because of covering indexes

On Thu, Apr 19, 2018 at 9:42 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:

But some later, in _bt_unlink_halfdead_page() we check ItemPointer high key
with ItemPointerIsValid macro - and it returns false, because offset is
actually InvalidOffsetNumber - i.e. 0 which was set by BTreeTupleSetNAtts.
And some wrong decisions are follows, I didn't look at that.

The problem here is that multi-level page deletion incorrectly thinks
that it's safe to fully delete a half-dead leaf page. Of course, this
is due to the issue with the representation of high keys that you
identified. IOW, VACUUM incorrect believes that fully deleting the
leaf page will leave no "dangling downlink" one level up, but, in
fact, there is one such dangling downlink. This could eventually lead
to wrong answers to queries, since the deleted page will eventually be
fully recycled.

I refined the amcheck enhancement quite a bit today. It will not just
check that a downlink is not missing; It will also confirm that it
wasn't a legitimately interrupted multi-level deletion, by descending
to the leaf page to match the leaf high key pointer to the top most
parent, which should be the target page (the page that lacked a
downlink according to the new Bloom filter). We need to worry about
multi-level deletions that are interrupted by an error or a hard
crash, which presents a legitimate case where there'll be no downlink
for an internal page in its parent. VACUUM is okay with that, so we
must be too.

I'll finish it off early next week as planned. It should
comprehensively cover issues with multi-level page deletion, including
cases where you have this kind of index corruption and page recycling
later makes it much worse. So, it won't just cover the exact test case
from Michael, which produced corruption that can be detected by
throwing an error when a downlink leads to a fully deleted page
(though only when amcheck has a ShareLock, of course).

Note that the problem here is best understood as a problem around the
presence of a dangling downlink, as oppose to a problem around the
absence of a needed downlink. There is an absent downlink, but that's
beside the point, and in any case missing a downlink is not
*inherently* wrong (as I said, it's not inherently wrong because of
the issue of interrupted multi-level page deletes).

--
Peter Geoghegan

#18

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Peter Geoghegan (#16)

Re: Corrupted btree index on HEAD because of covering indexes

Thank you, pushed

I see your point. Maybe don't have the newline between the get and
set, though, to match the existing style. And, the note about the
assertion seems unnecessary.

Removed newline and note

"Get/set leaf page highkey's link. During the second phase of
deletion, the target leaf page's high key may point to an ancestor
page (at all other times, the leaf level high key's link is not used).
See the nbtree README for full details."

Changed as you suggested.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#19

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Peter Geoghegan (#17)

Re: Corrupted btree index on HEAD because of covering indexes

On Sat, Apr 21, 2018 at 6:02 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I refined the amcheck enhancement quite a bit today. It will not just
check that a downlink is not missing; It will also confirm that it
wasn't a legitimately interrupted multi-level deletion, by descending
to the leaf page to match the leaf high key pointer to the top most
parent, which should be the target page (the page that lacked a
downlink according to the new Bloom filter). We need to worry about
multi-level deletions that are interrupted by an error or a hard
crash, which presents a legitimate case where there'll be no downlink
for an internal page in its parent. VACUUM is okay with that, so we
must be too.

Attached patch lets amcheck detect the issue when
bt_index_parent_check() is called, though only when heapallindexed
verification was requested (that is, only when bt_index_parent_check()
is doing something with a Bloom filter already). The new checks will
probably also detect any possible issue with multi-level page
deletions. The patch tightens up our general expectations around
half-dead and fully deleted pages, which seems necessary but also
independently useful.

I'm using work_mem to constrain the size of the second Bloom filter,
whereas the heapallindexed Bloom filter is constrained by
maintenance_work_mem. This seems fine to me, since we have always used
an additional work_mem budget for spool2 when building a unique index
within nbtsort.c. Besides, it will probably be very common for the
downlink Bloom filter to be less than 1% the size of the first Bloom
filter when we have adequate memory for both Bloom filters (i.e. very
small). I thought about mentioning this work_mem allocation in the
docs, but decided that there was no need, since the CREATE INDEX
spool2 work_mem stuff isn't documented anywhere either.

Note that the "c.relkind = 'i'" change in the docs is about not
breaking the amcheck query when local partitioned indexes happen to be
in use (when the user changed the sample SQL query to not just look at
pg_catalog indexes). See the "Local partitioned indexes and
pageinspect" thread I just started for full details.

The new P_ISDELETED() test within bt_downlink_missing_check() is what
actually detects the corruption that the test case causes, since the
fully deleted leaf page still has a sane top parent block number left
behind (that is, we don't get as far as testing "if
(BTreeTupleGetTopParent(itup) == state->targetblock)"; that's not how
the the leaf page can get corrupt in the test case that Michael
posted). Note that there are also two new similar P_ISDELETED() tests
added to two existing functions (bt_downlink_check() and
bt_check_level_from_leftmost()), but those tests won't detect the
corruption that we saw. They're really there to nail down how we think
about fully deleted pages.

--
Peter Geoghegan

#20

Teodor Sigaev

teodor@sigaev.ru

about 8 years ago

In reply to: Peter Geoghegan (#19)

Re: Corrupted btree index on HEAD because of covering indexes

Perfect! I would like to commit it but have some suggestions:

1)
TRUNCATE bttest_multi;
INSERT INTO bttest_multi SELECT i, i%2 FROM generate_series(1, 100000) as i;
SELECT bt_index_parent_check('bttest_multi_idx', true);

to improve test stability it would be better to disable autovacuum:
ALTER TABLE bttest_multi SET (autovacuum_enabled = false)

2) Pls, add to test DELETE as it done in 6db4b49986be3fe59a1f6ba6fabf9852864efc3e

3) It's not directly connected to this patch, but allocation of BtreeCheckState
is not needed, it could be allocated on stack.

4) Nevertheless, I suggest to use palloc0 (or memset(0)) for BtreeCheckState.
Now several fields of that structure could be not inited.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#21

Peter Geoghegan

pg@bowt.ie

about 8 years ago

In reply to: Teodor Sigaev (#20)

#22

Teodor Sigaev