B-tree parent pointer and checkpoints

Started by Heikki Linnakangasover 15 years ago34 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

We have the rm_safe_restartpoint mechanism to ensure that we don't use a
checkpoint that splits a multi-level B-tree insertion as a restart
point. But to my surprise, we don't have anything to protect against the
analogous case during normal operation. This is possible:

1. Split child page. Write WAL records for the child pages.
2. Begin and finish a checkpoint
3. Crash, before writing the WAL record of inserting the child pointer
in the parent B-tree page.
4. Recovery begins at the new checkpoint, never sees the incomplete
split, so it stays incomplete.

In practice that's pretty hard to hit, because a checkpoint takes some
time, while locking the parent page and writing the child pointer is
usually very quick. But it's possible.

It surprises me that we thought of this when we introduced
restartpoints, but this more obvious case during normal operation seems
to have been there forever. Nothing very bad happens if you lose the
parent update, but this would be nice to fix nevertheless.

I bumped into this while thinking about archive recovery - the above can
happen at archive recovery too if the checkpoint is caused by
pg_start_backup().

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written. That will close the
issue with restartpoints, archive recovery etc. as well, so we no longer
need to worry about this anywhere else than while performing an online
checkpoint.

I'm thinking of using the isCommit flag for this, to delay writing the
checkpoint record until all incomplete splits are finished. isCommit
protects against a similar race condition between writing commit record
and flushing the clog page, this race condition is similar. Will
obviously need to rename it, and double-check that it's safe: b-tree
splits take longer, and there's no critical section there like there is
in the commit codepath.

Comments?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#2'ghatpande@vsnl.net'
ghatpande@vsnl.net
In reply to: Heikki Linnakangas (#1)
Intelligent RDBMS

Hello All,
I am Vijay Ghatpande from Pune, India. I am in IT field for last 30 years and mainly in ERP design, development and implementation. I have worked on JD Edwards, Oracle Apps and SAP. I was involved in design and development of ERP package called ISP – Integrated Software for Production. Now I am independent and started my own consultancy AMS ERP Solutions and work for SME industries in and around Pune. I am technical and worked from COBO, DGL to .NET, C++. I have also worked as ORACLE, DB2 DBA for many years.

I have a dream project in my mind. It is on Relational Database Management. Pl read below paragraph from the angle of users of application developed on RDBMS.
Any application contains many tables and they are related to each other. RDBMS keeps relations. Developers, users have to know these relations to extract proper data and convert it into information. If multiple tables are involved then query can be more complex. Main problem with RDBMS is writing SQL. End users avoid SQL and thinks that this is a technical work. This is limitation of RDBMS. If SQL becomes free of table and relations then l am sure everyone will like it.

For example MS Dynamics has many tables to store the data. We can keep this data in such a way that for developer MS Dynamics will be one table say MSD and user can extract any data from that table. He need not know internal table relations. If this happens life will be easy for many. Application development will be very easy and cost effective. This can be achieved by creating views. I am thinking of intelligent database where relations are inbuilt. If this is available in the product itself then domain knowledge will be in RDBMS. If SQL becomes free of table and relations then l am sure everyone will like it. My intention is not to hide relations from users. But this will be one of the advantages of intelligent RDBMS. This will be next generation RDBMS. This is going to change the RDBMS, ERP world.

I am requesting you to give me feedback.

With Warm Regards,

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

regards, tom lane

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: B-tree parent pointer and checkpoints

On 02.11.2010 16:30, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

You mean after CreateCheckPoint has determined the redo pointer, but
before it has written the checkpoint record? The new split can go ahead,
and the checkpoint doesn't need care about it. Recovery will start at
the redo pointer, so it will see the split record, and will know to
finish the incomplete split if necessary.

The logic is the same as with inCommit. Checkpoint will fetch the list
of in-progress splits some time after determining the redo-pointer. It
will then wait until all of those splits have finished. Any new splits
that begin after fetching the list don't affect the checkpoint.

inCommit can't be used as is, because it's tied to the Xid, but
something similar should work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#4)
Re: B-tree parent pointer and checkpoints

On 02.11.2010 16:40, Heikki Linnakangas wrote:

On 02.11.2010 16:30, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

You mean after CreateCheckPoint has determined the redo pointer, but
before it has written the checkpoint record? The new split can go ahead,
and the checkpoint doesn't need care about it. Recovery will start at
the redo pointer, so it will see the split record, and will know to
finish the incomplete split if necessary.

The logic is the same as with inCommit. Checkpoint will fetch the list
of in-progress splits some time after determining the redo-pointer. It
will then wait until all of those splits have finished. Any new splits
that begin after fetching the list don't affect the checkpoint.

inCommit can't be used as is, because it's tied to the Xid, but
something similar should work.

Here's a first draft of this, using the inCommit flag as is. It works,
but suffers from starvation if you have a lot of concurrent
multi-WAL-record actions. I tested that by running INSERTs to a table
with tsvector field with a GiST index on it from five concurrent
sessions, and saw checkpoints regularly busy-waiting for over a minute.

To avoid that, we need something a little bit more complicated than a
boolean flag. I'm thinking of adding a counter beside the inCommit flag
that's incremented every time a new multi-WAL-record action begins, so
that the checkpoint process can distinguish between a new action that
was started after deciding the REDO pointer and an old one that's still
running.

(inCommit is a misnomer now, of course. Will need to find a better name..)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

split-delay-checkpoint-1.patchtext/x-diff; name=split-delay-checkpoint-1.patchDownload+95-71
#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#5)
Re: B-tree parent pointer and checkpoints

On 08.11.2010 15:40, Heikki Linnakangas wrote:

Here's a first draft of this, using the inCommit flag as is. It works,
but suffers from starvation if you have a lot of concurrent
multi-WAL-record actions. I tested that by running INSERTs to a table
with tsvector field with a GiST index on it from five concurrent
sessions, and saw checkpoints regularly busy-waiting for over a minute.

To avoid that, we need something a little bit more complicated than a
boolean flag. I'm thinking of adding a counter beside the inCommit flag
that's incremented every time a new multi-WAL-record action begins, so
that the checkpoint process can distinguish between a new action that
was started after deciding the REDO pointer and an old one that's still
running.

(inCommit is a misnomer now, of course. Will need to find a better name..)

Here's a 2nd version, with an additional counter in PGPROC to avoid
starving checkpoint in the face of a constant stream e.g GiST inserts.

The new rule is that before you start a multi-WAL-record operation that
needs to be completed at end of recovery if you crash in the middle, you
call HoldCheckpoint(), and once you're finished, ResumeCheckpoint().
rm_safe_restartpoint() is gone.

This is a pre-existing bug, but given the lack of field reports and the
fact that it's pretty darn hard to run into this in real life, I'm
inclined to not backpatch this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

split-delay-checkpoint-2.patchtext/x-diff; name=split-delay-checkpoint-2.patchDownload+350-292
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#6)
Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The new rule is that before you start a multi-WAL-record operation that
needs to be completed at end of recovery if you crash in the middle, you
call HoldCheckpoint(), and once you're finished, ResumeCheckpoint().

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#7)
Re: B-tree parent pointer and checkpoints

I wrote:

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

regards, tom lane

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#8)
Re: B-tree parent pointer and checkpoints

On 11.11.2010 00:49, Tom Lane wrote:

I wrote:

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in

log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent.

Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a
page that's not represented by the parent pointer? It's possible that
you insert a key to the leaf, update the leaf's immediate parent, but
crash before updating the parent's parent. As far as search is
concerned, that's OK as well, but it creates a hazard for subsequent
inserts. If you later insert a tuple with the same key to the same leaf
page, the insertion will see that the parent pointer already includes
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#8)
Re: B-tree parent pointer and checkpoints

On 11.11.2010 00:49, Tom Lane wrote:

I wrote:

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in

log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent.

Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a
page that's not represented by the parent pointer? It's possible that
you insert a key to the leaf, update the leaf's immediate parent, but
crash before updating the parent's parent. As far as search is
concerned, that's OK as well, but it creates a hazard for subsequent
inserts. If you later insert a tuple with the same key to the same leaf
page, the insertion will see that the parent pointer already includes
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#10)
Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

I think it'd be okay as far as that one entry is concerned, since as you
say it doesn't matter whether a search finds it. (We'd have to be sure
that VACUUM would still find it to remove it, of course, but that
doesn't use a normal search.) You're right that it poses a hazard of
subsequent inserts deciding that they don't need to do work on upper
levels because the lower ones look OK already. But depending on the
details of the search algorithm, this might be a non-problem: if you
remember that the upper level entries didn't cover your key when you
descended, you'd still know you need to recompute them.

Something else I just noticed is that WAL replay isn't capable of
completely fixing the index anyway:

* To complete insert we can't use basic insertion algorithm because
* during insertion we can't call user-defined support functions of opclass.
* So, we insert 'invalid' tuples without real key and do it by separate algorithm.
* 'invalid' tuple should be updated by vacuum full.

Given that there's no more vacuum full, and nobody has been expected to
run it routinely for a long time anyway, this fixup approach seems
pretty completely broken anyhow.

regards, tom lane

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#11)
Re: B-tree parent pointer and checkpoints

On 11.11.2010 17:16, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

I think it'd be okay as far as that one entry is concerned, since as you
say it doesn't matter whether a search finds it. (We'd have to be sure
that VACUUM would still find it to remove it, of course, but that
doesn't use a normal search.) You're right that it poses a hazard of
subsequent inserts deciding that they don't need to do work on upper
levels because the lower ones look OK already. But depending on the
details of the search algorithm, this might be a non-problem: if you
remember that the upper level entries didn't cover your key when you
descended, you'd still know you need to recompute them.

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

Something else I just noticed is that WAL replay isn't capable of
completely fixing the index anyway:

* To complete insert we can't use basic insertion algorithm because
* during insertion we can't call user-defined support functions of opclass.
* So, we insert 'invalid' tuples without real key and do it by separate algorithm.
* 'invalid' tuple should be updated by vacuum full.

Given that there's no more vacuum full, and nobody has been expected to
run it routinely for a long time anyway, this fixup approach seems
pretty completely broken anyhow.

The 'invalid' tuples don't affect correctness, but are a drag on
performance, so they are similar to incomplete b-tree splits. I suspect
the overhead of an invalid gist pointer is much bigger than the overhead
of an incomplete b-tree split, though. I agree we should get rid of
that, it's not comforting to get a stream of messages in the logs saying
you should run VACUUM FULL.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#12)
Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

Oh, that's a really good idea, I think. But what about page splits?
I guess in the case of a split, you'd be replacing the parent entry
anyway, so having previously updated it to something larger doesn't
really cause a problem other than wasting a few cycles --- which are
probably still less than you save by not having to traverse back up.

If we supported UNIQUE GIST indexes then you could criticize this plan
on the grounds that parent entries would get uselessly enlarged before
detecting a uniqueness failure; but we don't and I know of no plans to.
So on the whole I think it sounds good. Teodor, what do you think?

regards, tom lane

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#13)
Re: B-tree parent pointer and checkpoints

On 11.11.2010 20:34, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

Oh, that's a really good idea, I think. But what about page splits?
I guess in the case of a split, you'd be replacing the parent entry
anyway, so having previously updated it to something larger doesn't
really cause a problem other than wasting a few cycles --- which are
probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. The
right-links in GiST work differently from b-tree, a right-link is only
followed if we detect a concurrent page split. A concurrent split is
detected by comparing the "NSN" field on the child page against the LSN
that we saw on the parent when walking down. It means that if you just
leave the incompletely split page in the tree, where one half is missing
the parent pointer, scans will not find any tuples on that page. They
would at first, but as soon as the the parent page is updated due to
some unrelated insertion, the LSN of the parent is bumped above the NSN
stored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent page
locked while the child is split, until the downlink is inserted into the
parent. That blocks any other modifications to the parent page that
would bump the LSN, until our downlink has been inserted. That doesn't
work after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the page
split algorithm. In a nutshell, when the child page is split, put a flag
on the left half indicating that the rightlink must always be followed,
regardless of the NSN. When the downlink is inserted to the parent,
clear the flag. Setting and clearing of these flags need to be performed
during WAL replay as well.

So to split a page:

(0. Lock the page to be split)
1. Split the page. Mark the rightlink in the left half with a flag
indicating that it always needs to be followed.
2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)
4. Clear the flag in the child, and update NSN to the LSN of the
downlink insertion record.
5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, so
scans will know to always follow it. If we crash after step 3, recovery
will replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequent
inserts, by performing steps 2-4.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#14)
Re: B-tree parent pointer and checkpoints

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

--
greg

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#15)
Re: B-tree parent pointer and checkpoints

On 13.11.2010 00:34, Greg Stark wrote:

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#16)
Re: B-tree parent pointer and checkpoints

Has this been addressed?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:

On 13.11.2010 00:34, Greg Stark wrote:

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#18Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#17)
Re: B-tree parent pointer and checkpoints

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

---------------------------------------------------------------------------

---------------------------------------------------------------------------

Heikki Linnakangas wrote:

On 13.11.2010 00:34, Greg Stark wrote:

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#18)
Re: B-tree parent pointer and checkpoints

On 10.03.2011 22:50, Bruce Momjian wrote:

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

We fixed GiST. B-tree still has the issue that if you have a checkpoint
in the middle of an insert, and crash, you might be left with a leaf
node without a downlink in the parent.

That's relatively harmless, index searches and insertions work without
the downlink. However, there's code in page deletion that ERRORs if the
parent can't be found. That should be fixed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#19)
Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 10.03.2011 22:50, Bruce Momjian wrote:

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

We fixed GiST. B-tree still has the issue that if you have a checkpoint
in the middle of an insert, and crash, you might be left with a leaf
node without a downlink in the parent.

But that will be fixed during WAL replay.

That's relatively harmless, index searches and insertions work without
the downlink. However, there's code in page deletion that ERRORs if the
parent can't be found. That should be fixed.

I don't like the idea of removing that consistency check, and I don't
think it's necessary to do so.

regards, tom lane

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#24)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#25)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#27)
#29Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#29)
#31Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#23)
#32Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#28)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#32)
#34Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#33)