GiST insert algorithm rewrite

Started by Heikki Linnakangasover 15 years ago42 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Following the discussions here
(http://archives.postgresql.org/message-id/12375.1289429390@sss.pgh.pa.us),
here's a draft version of a rewrite of the GiST insertion algorithm and
WAL-logging. The goal is to get rid of the tracking of incomplete splits
and inserts in WAL replay, and the fixup activities at the end of recovery.

There are two key changes to the insert algorithm:

1. When we walk down the tree searching for a suitable leaf page to
insert the new tuple to, we update the nodes on the way down so that
they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is
self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
the purpose is to ensure that the tree is self-consistent at all times.
If we crash just after a page split, before the downlink is inserted,
scans will find the tuples on the right page by following the rightlink.
It's slightly less performant, but correct.

The WAL replay code has been simplified accordingly, we no longer need
any fixup code to finish incomplete splits or inserts, and we no longer
need the "invalid" pointers in the tree.

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

I haven't done much testing yet, although it passes the regression
tests. I'll do more testing, and crash testing, and I'll have to
re-review the patch myself before committing.

Comments?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist-insert-rewrite-1.patchtext/x-diff; name=gist-insert-rewrite-1.patchDownload+1328-1747
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: GiST insert algorithm rewrite

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

There are two key changes to the insert algorithm:

1. When we walk down the tree searching for a suitable leaf page to
insert the new tuple to, we update the nodes on the way down so that
they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is
self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
the purpose is to ensure that the tree is self-consistent at all times.
If we crash just after a page split, before the downlink is inserted,
scans will find the tuples on the right page by following the rightlink.
It's slightly less performant, but correct.

The one thought that comes to mind is how does the flag business work
after multiple splittings? That is, assume we have a page that has the
flag set because of a previous crash. If we have to split either that
page or its right sibling, what do we do with the flags? I think that
it can be made to work, so long as searches keep moving right as long
as the flag is set. But this needs to be thought through, and
documented in the README file. I'm particularly worried whether the
after-the-fact fixup that you mention in README might fail in such
scenarios.

regards, tom lane

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#2)
Re: GiST insert algorithm rewrite

On 16.11.2010 20:01, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
the purpose is to ensure that the tree is self-consistent at all times.
If we crash just after a page split, before the downlink is inserted,
scans will find the tuples on the right page by following the rightlink.
It's slightly less performant, but correct.

The one thought that comes to mind is how does the flag business work
after multiple splittings? That is, assume we have a page that has the
flag set because of a previous crash. If we have to split either that
page or its right sibling, what do we do with the flags?

As I mentioned in the README, the insertion algorithm finishes any
incomplete splits it sees before proceeding. AFAICS that should ensure
that the situation never arises where you try to split a page that
already has the flag set. Or its right sibling; such a page can only be
reached via the rightlink so you would see the page with the flag set first.

Hmm, there is one corner-case that I didin't think of before: One
backend splits a leaf page, and another backend concurrently splits the
parent of that leaf page. If for some reason the backend operating on
the parent page dies, releasing the locks, the other backend will see
the incomplete split when it walks up to insert the downlink. Although
it should be possible to handle that, I think we can simply give up on
inserting the downlink in that case, and leave that split incomplete as
well. It's still correct, and next insert that comes along will complete
the splits, from top to bottom.

BTW, I don't try to fix incomplete splits during vacuum in the patch.
That's perhaps a bit surprising, and probably would be easy to add, but
I left it out for now as it's not strictly necessary.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: GiST insert algorithm rewrite

On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

BTW, I don't try to fix incomplete splits during vacuum in the patch. That's
perhaps a bit surprising, and probably would be easy to add, but I left it
out for now as it's not strictly necessary.

Seems like it would be good to have this; otherwise, the split might
stay incompletely indefinitely? Would that be bad?

If we start to enlarge the bounding boxes on the higher levels of the
tree and then crash before inserting the key, is there any mechanism
for getting them back down to the minimal size?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#4)
Re: GiST insert algorithm rewrite

On 16.11.2010 20:46, Robert Haas wrote:

On Tue, Nov 16, 2010 at 1:22 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

BTW, I don't try to fix incomplete splits during vacuum in the patch. That's
perhaps a bit surprising, and probably would be easy to add, but I left it
out for now as it's not strictly necessary.

Seems like it would be good to have this; otherwise, the split might
stay incompletely indefinitely? Would that be bad?

Nothing bad should happen. Scans that need to traverse the incompletely
split page would just be marginally slower.

If we start to enlarge the bounding boxes on the higher levels of the
tree and then crash before inserting the key, is there any mechanism
for getting them back down to the minimal size?

No. There's also no mechanism for trimming the bounding boxes if a tuple
is deleted.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: GiST insert algorithm rewrite

On Tue, Nov 16, 2010 at 1:50 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

If we start to enlarge the bounding boxes on the higher levels of the
tree and then crash before inserting the key, is there any mechanism
for getting them back down to the minimal size?

No. There's also no mechanism for trimming the bounding boxes if a tuple is
deleted.

Oh. So do the indexes just degrade over time until they eventually
need to be REINDEX'd?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Teodor Sigaev
teodor@sigaev.ru
In reply to: Heikki Linnakangas (#1)
Re: GiST insert algorithm rewrite

they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is

Hm, performance? while you traverse to leaf page, on each inner page you will
need to call unionFn/equalFn methods to decide update or not key on inner page.
Now we stops do that after first positive result of equalFn while walking up.
Next, when child page splits then you need to update parent twice - first time
while walk down, and second while walk up.

As I see, you try to implement algorithm very close to original, but it was
rather slow.
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3

self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again

Again, twice write of new children (it could be several because of
implementation of user-defined picksplit method).

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: GiST insert algorithm rewrite

Robert Haas <robertmhaas@gmail.com> writes:

Oh. So do the indexes just degrade over time until they eventually
need to be REINDEX'd?

At some point you might reach a state where a reindex would be helpful.
But the same is true of btrees. I don't think this is a serious
objection, at least not unless backed by evidence that the tree often
degrades rapidly. Anyway fixing it would be material for a different
patch.

regards, tom lane

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Teodor Sigaev (#7)
Re: GiST insert algorithm rewrite

On 16.11.2010 21:20, Teodor Sigaev wrote:

they are consistent with the new key we're inserting. The old GiST
algorithm adjusted all the parent pages only after inserting the tuple
on the leaf. Updating them on the way down ensures that the tree is

Hm, performance? while you traverse to leaf page, on each inner page you
will need to call unionFn/equalFn methods to decide update or not key on
inner page. Now we stops do that after first positive result of equalFn
while walking up. Next, when child page splits then you need to update
parent twice - first time while walk down, and second while walk up.

Hmm, will have to do some benchmarking on that. I'm using the Consistent
function when walking down to check if the downlink needs to be updated,
and assumed that it would be insignificant compared to the cost of
calling Penalty on all the keys on the page.

As I see, you try to implement algorithm very close to original, but it
was rather slow.
http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=0ad7db4be4b1f0208271c49fc1c8348f11ebc5b3

Thanks for that, I'll have to read that through as well.

self-consistent at all times, even if we crash just after inserting the
new tuple on the leaf page.

2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again

Again, twice write of new children (it could be several because of
implementation of user-defined picksplit method).

There should be no difference in performance here AFAICS. The children
need to be updated a second time to clear the flag, but we don't release
the locks on them in the middle, and we're only talking about setting a
single flag, so it should make no difference.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: GiST insert algorithm rewrite

On Tue, Nov 16, 2010 at 3:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Oh.  So do the indexes just degrade over time until they eventually
need to be REINDEX'd?

At some point you might reach a state where a reindex would be helpful.
But the same is true of btrees.  I don't think this is a serious
objection, at least not unless backed by evidence that the tree often
degrades rapidly.  Anyway fixing it would be material for a different
patch.

Oh, I agree it's not for this patch to fix it, if it's already that
way. I was just curious. I think index maintenance is going to be a
problem we have to devote some cycles to down the road, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Teodor Sigaev
teodor@sigaev.ru
In reply to: Heikki Linnakangas (#9)
Re: GiST insert algorithm rewrite

Hmm, will have to do some benchmarking on that. I'm using the Consistent
function when walking down to check if the downlink needs to be updated,
and assumed that it would be insignificant compared to the cost of
calling Penalty on all the keys on the page.

Why consistent?! It's impossible - you don't know right strategy number, index
with storage type/over type could do not accept the same type as query. Index
over tsvector is an example.

There should be no difference in performance here AFAICS. The children
need to be updated a second time to clear the flag, but we don't release
the locks on them in the middle, and we're only talking about setting a
single flag, so it should make no difference.

Agree with that
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#12Teodor Sigaev
teodor@sigaev.ru
In reply to: Heikki Linnakangas (#9)
Re: GiST insert algorithm rewrite

Sorry, I missed beginning of discussion on GiST, so I read it on the web mail
archive.

You wrote:
http://archives.postgresql.org/pgsql-hackers/2010-11/msg00939.php
[skip]
0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and new
buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original downlink is
updated to reflect only the keys that stayed on the left page). While keeping
the child pages locked, the NSN field on the children are updated with the new
LSN of the parent page.
...
The scan checks that by comparing the LSN it saw on the parent page with the NSN
on the child page. If parent LSN < NSN, we saw the parent before the downlink
was inserted.

Now, the problem with crash recovery is that the above algorithm depends on the
split to keep the parent and child locked until the downlink is inserted in the
parent. If you crash between steps 2 and 3, the locks are gone. If a later
insert then updates the parent page, because of a split on some unrelated child
page, that will bump the LSN of the parent above the NSN on the child. Scans
will see that the parent LSN > child NSN, and will no longer follow the > rightlink.
[skip]

I disagree with that opinion: if we crash between 2 and 3 then why will somebody
update parent before WAL replay? WAL replay process in this case should complete
child split by inserting "invalid" pointer and tree become correct again,
although it needs to repair "invalid" pointers. The same situation with b-tree:
WAL replay repairs incomplete split before any other processing.

Or do I miss something important?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Teodor Sigaev (#12)
Re: GiST insert algorithm rewrite

On 17.11.2010 19:46, Teodor Sigaev wrote:

I disagree with that opinion: if we crash between 2 and 3 then why will
somebody update parent before WAL replay? WAL replay process in this
case should complete child split by inserting "invalid" pointer and tree
become correct again, although it needs to repair "invalid" pointers.
The same situation with b-tree: WAL replay repairs incomplete split
before any other processing.

Or do I miss something important?

Yeah, see the thread that started this:

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00052.php
http://archives.postgresql.org/message-id/12375.1289429390@sss.pgh.pa.us

The code currently relies on the end-of-recovery processing to finish
the incomplete, but I'm trying to get rid of that end-of-recovery
processing altogether.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Teodor Sigaev (#11)
Re: GiST insert algorithm rewrite

On 17.11.2010 19:36, Teodor Sigaev wrote:

Hmm, will have to do some benchmarking on that. I'm using the Consistent
function when walking down to check if the downlink needs to be updated,
and assumed that it would be insignificant compared to the cost of
calling Penalty on all the keys on the page.

Why consistent?! It's impossible - you don't know right strategy number,
index with storage type/over type could do not accept the same type as
query. Index over tsvector is an example.

Sorry, I was confused. I'm calling the gistgetadjusted() function, which
uses the Union function. Ie. I'm doing the same we did before when
propagating the changes up the tree. I'm just doing it on the way down
instead.

I ran some quick performance tests on my laptop, and couldn't see any
measurable difference with the patch. So I think we're good on
performance. I used the attached scripts, with \timing.

Have you had a chance to look at the patch yet? I'm hesitant to commit
before you take a look at it, though I still have to proofread it myself
as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gisttest2.sqltext/x-sql; name=gisttest2.sqlDownload
gisttest.sqltext/x-sql; name=gisttest.sqlDownload
#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#14)
Re: GiST insert algorithm rewrite

On 18.11.2010 14:58, Heikki Linnakangas wrote:

On 17.11.2010 19:36, Teodor Sigaev wrote:

Hmm, will have to do some benchmarking on that. I'm using the Consistent
function when walking down to check if the downlink needs to be updated,
and assumed that it would be insignificant compared to the cost of
calling Penalty on all the keys on the page.

Why consistent?! It's impossible - you don't know right strategy number,
index with storage type/over type could do not accept the same type as
query. Index over tsvector is an example.

Sorry, I was confused. I'm calling the gistgetadjusted() function, which
uses the Union function. Ie. I'm doing the same we did before when
propagating the changes up the tree. I'm just doing it on the way down
instead.

I ran some quick performance tests on my laptop, and couldn't see any
measurable difference with the patch. So I think we're good on
performance. I used the attached scripts, with \timing.

Have you had a chance to look at the patch yet? I'm hesitant to commit
before you take a look at it, though I still have to proofread it myself
as well.

Here's an updated version with some minor fixes. I'd appreciate review,
as well as pointers to good test cases for this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

gist-insert-rewrite-2.patchtext/x-diff; name=gist-insert-rewrite-2.patchDownload+1419-1900
#16Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#1)
Re: GiST insert algorithm rewrite

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?

You are saying it would have to be run before the upgrade. Can it not
be run after?

I can output a script to VACUUM all such indexes, and tell users to
manually REINDEX any index that generates a warning messasge. I don't
have any way to automate an optional REINDEX step.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#16)
Re: GiST insert algorithm rewrite

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?

9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.

You are saying it would have to be run before the upgrade. Can it not
be run after?

After would work too.

I can output a script to VACUUM all such indexes, and tell users to
manually REINDEX any index that generates a warning messasge. I don't
have any way to automate an optional REINDEX step.

That seems good enough.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#17)
Re: GiST insert algorithm rewrite

On 30.11.2010 11:55, Heikki Linnakangas wrote:

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?

9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.

Oh sorry, I read your question as "9.0 *or* 9.1".

Only GiST indexes that have any "invalid" tuples in them n, as a result
of a crash, need to be reindexed. That's very rare in practice, so we
shouldn't invalidate all GiST indexes. I don't think there's any simple
way to check whether reindex is required, so I think we have to just
document this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: GiST insert algorithm rewrite

On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 30.11.2010 11:55, Heikki Linnakangas wrote:

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?

9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.

Oh sorry, I read your question as "9.0 *or* 9.1".

Only GiST indexes that have any "invalid" tuples in them n, as a result of a
crash, need to be reindexed. That's very rare in practice, so we shouldn't
invalidate all GiST indexes. I don't think there's any simple way to check
whether reindex is required, so I think we have to just document this.

It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#19)
Re: GiST insert algorithm rewrite

On 30.11.2010 16:23, Robert Haas wrote:

On Tue, Nov 30, 2010 at 5:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 30.11.2010 11:55, Heikki Linnakangas wrote:

On 27.11.2010 21:31, Bruce Momjian wrote:

Heikki Linnakangas wrote:

There's no on-disk format changes, except for the additional flag in the
page headers, so this does not affect pg_upgrade. However, if there's
any "invalid" keys in the old index because of an incomplete insertion,
the new code will not understand that. So you should run vacuum to
ensure that there's no such invalid keys in the index before upgrading.
Vacuum will print a message in the log if it finds any, and you will
have to reindex. But that's what it suggests you to do anyway.

OK, pg_upgrade has code to report invalid gin and hash indexes because
of changes between PG 8.3 and 8.4. Is this something we would do for
9.0 to 9.1?

9.1. The problem that started this whole thing is there in older
versions, but given the lack of real-life reports and the scale of the
changes required it doesn't seem wise to backport.

Oh sorry, I read your question as "9.0 *or* 9.1".

Only GiST indexes that have any "invalid" tuples in them n, as a result of a
crash, need to be reindexed. That's very rare in practice, so we shouldn't
invalidate all GiST indexes. I don't think there's any simple way to check
whether reindex is required, so I think we have to just document this.

It seems odd to say, the indexes are corrupted, but they're probably
not, so let's not worry about it.

I assume there's no way to make the new code cope with any
pre-existing corruption?

Does the current code cope with the corruption?

It's not corruption, but "intended degradation". Yes, the current code
copes with it, that's how GiST survives a crash. However, even with the
current code, VACUUM will nag if it finds any invalid tuples with this
message:

ereport(NOTICE,
(errmsg("index \"%s\" needs VACUUM FULL or REINDEX to finish crash
recovery",

That's harmless, in the sense that all scans and inserts work fine, but
scans might need to do more work than if the invalid tuple wasn't there.

I don't think we need to go out of our way to support such degraded
indexes in 9.1. If you see such notices in your logs, you should REINDEX
anyway, before of after pg_upgrade. Let's just make sure that you get a
reasonable error message in 9.1 if a scan or insert encounters such a tuple.

There is a section on this in the docs, BTW:
http://www.postgresql.org/docs/9.0/static/gist-recovery.html

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#20)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#23)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#25)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#25)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#27)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#28)
#30Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#29)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#27)
#32Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#31)
#33Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#32)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#33)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#34)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#35)
#37Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#36)
#38Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#36)
#39Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#38)
#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#40)
#42Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#41)