Documentation of bt_page_items()'s ctid field

Started by Peter Geogheganabout 11 years ago12 messages
#1Peter Geoghegan
pg@heroku.com
1 attachment(s)

Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.

--
Peter Geoghegan

Attachments:

bt_page_items_ctid.patchtext/x-patch; charset=US-ASCII; name=bt_page_items_ctid.patchDownload
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 6f51dc6..433bcb2 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -192,6 +192,13 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
           7 | (0,7)   |      12 | f     | f    | 29 27 00 00
           8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
 </screen>
+      Note that <structfield>ctid</> addresses a heap tuple if the
+      page under consideration is a B-Tree leaf page.  Otherwise, for
+      internal B-Tree pages <structfield>ctid</> addresses a page in
+      the B-Tree itself (excluding the root page if and only if there
+      has not yet been a root page split, as in the example above).
+      These internally referenced pages are child pages, and may
+      themselves be leaf pages or internal pages.
      </para>
     </listitem>
    </varlistentry>
#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Peter Geoghegan (#1)
Re: Documentation of bt_page_items()'s ctid field

On 12/30/2014 04:08 AM, Peter Geoghegan wrote:

Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.

How much detail on the b-tree internals do we want to put in the
pageinspect documentation? I can see that being useful, but should we
also explain e.g. that the first item on each (non-rightmost) page is
the high key?

+      Note that <structfield>ctid</> addresses a heap tuple if the
+      page under consideration is a B-Tree leaf page.  Otherwise, for
+      internal B-Tree pages <structfield>ctid</> addresses a page in
+      the B-Tree itself (excluding the root page if and only if there
+      has not yet been a root page split, as in the example above).
+      These internally referenced pages are child pages, and may
+      themselves be leaf pages or internal pages.

I had a hard time understanding the remark about the root page. But in
any case, if you look at the flags set e.g. with bt_page_stats(), the
root page is flagged as also being a leaf page, when it is the only page
in the index. So the root page is considered also a leaf page in that case.

I'd suggest saying the same thing (or more) with fewer words:

In a B-tree leaf page, <structfield>ctid</> points to a heap tuple. In
an internal page, it points to another page in the index itself, and the
offset number part (the second number) of the ctid field is ignored.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#2)
1 attachment(s)
Re: Documentation of bt_page_items()'s ctid field

On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

How much detail on the b-tree internals do we want to put in the pageinspect
documentation? I can see that being useful, but should we also explain e.g.
that the first item on each (non-rightmost) page is the high key?

Maybe we should. I see no reason not to, and I think that it makes
sense to explain things at that level without going into flags and so
on. But don't forget that that isn't quite the full story if we're
going to talk about high keys at all; we must also explain "minus
infinity" keys, alongside any explanation of the high key:

* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
* scankey. The actual key value stored (if any, which there probably isn't)
* does not matter. This convention allows us to implement the Lehman and
* Yao convention that the first down-link pointer is before the first key.
* See backend/access/nbtree/README for details.

In particular, this means that the key data is garbage, which is
something I've also seen causing confusion [1]/messages/by-id/20140828.110824.1195843073079055852.t-ishii@sraoss.co.jp -- Peter Geoghegan.

I would like to make it easier for competent non-experts on the B-Tree
code to eyeball a B-Tree with pageinspect, and be reasonably confident
that things add up. In order for such people to know that something is
wrong, we should explain what "right" looks like in moderate detail.
So, as I said, I feel an exact explanation of flags is unnecessary,
but tend to agree that a brief reference to both page highkeys and
"minus infinity" keys is appropriate, since users of the function will
see them all the time.

I had a hard time understanding the remark about the root page. But in any
case, if you look at the flags set e.g. with bt_page_stats(), the root page
is flagged as also being a leaf page, when it is the only page in the index.
So the root page is considered also a leaf page in that case.

I think that a better way of handling that originally would have been
to make root-ness a separate property from leaf-ness/internal-ness.
Too late for that now, I suppose.

I'd suggest saying the same thing (or more) with fewer words:

In a B-tree leaf page, <structfield>ctid</> points to a heap tuple. In an
internal page, it points to another page in the index itself, and the offset
number part (the second number) of the ctid field is ignored.

That seems good. What do you think of the attached revision?

[1]: /messages/by-id/20140828.110824.1195843073079055852.t-ishii@sraoss.co.jp -- Peter Geoghegan
--
Peter Geoghegan

Attachments:

bt_page_items_ctid.patchtext/x-patch; charset=US-ASCII; name=bt_page_items_ctid.patchDownload
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 6f51dc6..ffc620a 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -192,6 +192,16 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
           7 | (0,7)   |      12 | f     | f    | 29 27 00 00
           8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
 </screen>
+      In a B-tree leaf page, <structfield>ctid</> points to a heap
+      tuple.  In an internal page, it points to another page in the
+      index itself, and the offset number part (the second number) of
+      the <structfield>ctid</> field is ignored.  Note that the first
+      item on any non-rightmost page is the page <quote>high
+      key</quote>, meaning its <structfield>data</> serves as an upper
+      bound on all items appearing on the page.  Also, on non-leaf
+      pages, the first "data item" (the first item that is not a high
+      key) is a <quote>minus infinity</quote> item, implying that the
+      value of <structfield>data</> is garbage.
      </para>
     </listitem>
    </varlistentry>
#4Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Peter Geoghegan (#3)
Re: Documentation of bt_page_items()'s ctid field

On 12/30/2014 10:07 PM, Peter Geoghegan wrote:

On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

How much detail on the b-tree internals do we want to put in the pageinspect
documentation? I can see that being useful, but should we also explain e.g.
that the first item on each (non-rightmost) page is the high key?

Maybe we should. I see no reason not to, and I think that it makes
sense to explain things at that level without going into flags and so
on. But don't forget that that isn't quite the full story if we're
going to talk about high keys at all; we must also explain "minus
infinity" keys, alongside any explanation of the high key:

Yeah, good point.

* CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
* "minus infinity": this routine will always claim it is less than the
* scankey. The actual key value stored (if any, which there probably isn't)
* does not matter. This convention allows us to implement the Lehman and
* Yao convention that the first down-link pointer is before the first key.
* See backend/access/nbtree/README for details.

In particular, this means that the key data is garbage, which is
something I've also seen causing confusion [1].

In practice, we never store any actual key value for the "minus
infinity" key. I guess the code would ignore it if it was there, but it
would make more sense to explain that the first data key on an internal
page does not have a key value. If there is a value there, it's a sign
that something's wrong.

I would like to make it easier for competent non-experts on the B-Tree
code to eyeball a B-Tree with pageinspect, and be reasonably confident
that things add up. In order for such people to know that something is
wrong, we should explain what "right" looks like in moderate detail.

Makes sense.

I had a hard time understanding the remark about the root page. But in any
case, if you look at the flags set e.g. with bt_page_stats(), the root page
is flagged as also being a leaf page, when it is the only page in the index.
So the root page is considered also a leaf page in that case.

I think that a better way of handling that originally would have been
to make root-ness a separate property from leaf-ness/internal-ness.

Hmm, yeah, bt_page_stats() currently returns 'l' in the type column when
(BTP_ROOT | BTP_LEAF).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#4)
Re: Documentation of bt_page_items()'s ctid field

On Tue, Dec 30, 2014 at 12:21 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

In practice, we never store any actual key value for the "minus infinity"
key. I guess the code would ignore it if it was there, but it would make
more sense to explain that the first data key on an internal page does not
have a key value. If there is a value there, it's a sign that something's
wrong.

Well, depends on what you're exact definition of a key is, I suppose.
Here is what I see for a B-Tree's root page (it's a few hundred KiBs
in size):

postgres=# select * from bt_page_items('ix_order_custid', 3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+--------+---------+-------+------+-------------------------
1 | (1,1) | 8 | f | f |
2 | (2,1) | 16 | f | f | 65 02 00 00 00 00 00 00
3 | (4,1) | 16 | f | f | d2 04 00 00 00 00 00 00
4 | (5,1) | 16 | f | f | 28 07 00 00 00 00 00 00
5 | (6,1) | 16 | f | f | b9 09 00 00 00 00 00 00

*** SNIP ***

It's storing an index tuple, but not a key value itself. I guess that
the equivocating comments (that I pasted here before, that appear over
_bt_compare()) about what is stored made me use the term garbage, but
off hand I'm not aware of how that could actually happen either.
Perhaps it's an obsolete comment, and the "data" field will always be
empty for "minus infinity" items?

I'll leave the exact description of this state of the "data" field to you.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: Documentation of bt_page_items()'s ctid field

On Tue, Dec 30, 2014 at 8:59 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 12/30/2014 04:08 AM, Peter Geoghegan wrote:

Attached documentation patch describes the purpose of
bt_page_items()'s ctid field. This has come up enough times in
disaster recovery or testing scenarios that I feel it's worth drawing
particular attention to.

How much detail on the b-tree internals do we want to put in the
pageinspect documentation? I can see that being useful, but should we also
explain e.g. that the first item on each (non-rightmost) page is the high
key?

How do I know if I am looking at a non-rightmost page?

Thanks,

Jeff

#7Peter Geoghegan
pg@heroku.com
In reply to: Jeff Janes (#6)
Re: Documentation of bt_page_items()'s ctid field

On Mon, Mar 9, 2015 at 3:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

How do I know if I am looking at a non-rightmost page?

It has a right-link (that's the easiest way to tell). It will (as the
docpatch points out) also necessarily have a ""high key" item. We
know that we have to move right if the high key doesn't bound the
value we expected to find on the page (our scankey item - what the
index scan is searching for). So the high key goes with having a
rightlink, and in general we detect that we're looking at a
non-rightmost page based on the presence of a right-link.

The confusing thing is that some pages have a high-key *and* a "minus
infinity" item. The latter has "data" that is undefined (cannot be
interpreted as a value of any type). It is still logically a "real"
item, that represents minus infinity for the purposes of binary
searches. But the high key is not a "real" item, even though its
"data" *is* a well-defined value of the underlying IndexTuple type or
types. Typically, the high key "data" matches the first non-highkey
"real" item on the next/right page...not including the "minus
infinity" item, if any.

Is that any clearer? Basically, high key items are relevant to all
non-rightmost pages on all levels. OTOH, "minus infinity" items are
only relevant to internal (non-leaf) pages (and a pre-first-split root
page is considered a leaf page for this purpose). Internal,
non-rightmost pages (internal pages are typically only a small
minority of the total) have both.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Geoghegan (#7)
Re: Documentation of bt_page_items()'s ctid field

On Mon, Mar 9, 2015 at 4:06 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Mar 9, 2015 at 3:51 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

How do I know if I am looking at a non-rightmost page?

It has a right-link (that's the easiest way to tell).

Meaning that btpo_next is not zero? Should we say that in the patch in so
many words? I think it will be hard to explain the page_items more without
also explaining the page_stats more.

It will (as the
docpatch points out) also necessarily have a ""high key" item. We
know that we have to move right if the high key doesn't bound the
value we expected to find on the page (our scankey item - what the
index scan is searching for). So the high key goes with having a
rightlink, and in general we detect that we're looking at a
non-rightmost page based on the presence of a right-link.

So if I understand this correctly, if there is a high key it is itemoffset
1, but to know whether itemoffset 1 is a high key I first have to look at
btpo_next.

And if there is a minus infinity, it will either be itemoffset 2 or 1,
depending on whether there is a high key or not.

Thanks,

Jeff

#9Peter Geoghegan
pg@heroku.com
In reply to: Jeff Janes (#8)
1 attachment(s)
Re: Documentation of bt_page_items()'s ctid field

On Mon, Mar 9, 2015 at 5:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

It has a right-link (that's the easiest way to tell).

Meaning that btpo_next is not zero? Should we say that in the patch in so
many words? I think it will be hard to explain the page_items more without
also explaining the page_stats more.

Yes. And my wording above was sloppy: By definition, a non-rightmost
page is a page that has a rightlink (it will also have a highkey, but
that's certainly not how we distinguish them).

Attached patch adds a definition of non-rightmost in parenthesis,
which I think helps.

So if I understand this correctly, if there is a high key it is itemoffset
1, but to know whether itemoffset 1 is a high key I first have to look at
btpo_next.

Yes. The logic for caring about "minus infinity" items within internal
pages is hardcoded into _bt_compare(), so it's kind of a low-level
detail, even by these standards...but it comes up with pageinspect,
since they consume an IndexTuple. Confusion ensues.

And if there is a minus infinity, it will either be itemoffset 2 or 1,
depending on whether there is a high key or not.

That's correct. So "data" can be "(has data), (has no data), (has
data), (has data) .... " on a given page, which could freak users out.

--
Peter Geoghegan

Attachments:

bt_page_items_rightmost.patchtext/x-patch; charset=US-ASCII; name=bt_page_items_rightmost.patchDownload
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 6f51dc6..502903c 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -192,6 +192,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
           7 | (0,7)   |      12 | f     | f    | 29 27 00 00
           8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
 </screen>
+      In a B-tree leaf page, <structfield>ctid</> points to a heap
+      tuple.  In an internal page, it points to another page in the
+      index itself, and the offset number part (the second number) of
+      the <structfield>ctid</> field is ignored.  Note that the first
+      item on any non-rightmost page (any page lacking a
+      <structfield>btpo_next</> field) is the page <quote>high
+      key</quote>, meaning its <structfield>data</> serves as an upper
+      bound on all items appearing on the page.  Also, on non-leaf
+      pages, the first "data item" (the first item that is not a high
+      key) is a <quote>minus infinity</quote> item, implying that its
+      <structfield>data</> field lacks any value.
      </para>
     </listitem>
    </varlistentry>
#10Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Geoghegan (#9)
1 attachment(s)
Re: Documentation of bt_page_items()'s ctid field

On Mon, Mar 9, 2015 at 5:34 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Mar 9, 2015 at 5:18 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

It has a right-link (that's the easiest way to tell).

Meaning that btpo_next is not zero? Should we say that in the patch in

so

many words? I think it will be hard to explain the page_items more

without

also explaining the page_stats more.

Yes. And my wording above was sloppy: By definition, a non-rightmost
page is a page that has a rightlink (it will also have a highkey, but
that's certainly not how we distinguish them).

Attached patch adds a definition of non-rightmost in parenthesis,
which I think helps.

Thanks. I think the sense of the definition is wrong (you are defining a
rightmost page, not a non-rightmost page), and the btpo_next field is zero,
not null, when on a rightmost page. Or at least, bt_page_stats presents it
as zero.

So more like the attached patch.

I think we do want this. It was asked how much we want to include
internals of the btree code into pageinspect documentation, but I think the
nature of pageinspect makes that unavoidable. Its docs have to tell you
what it does, and inspecting the internals is what it does. And since
those internals are unlikely to change other than in a way that breaks
pg_upgrade, I think we can safely assume that the work of updating the doc
would pale compared to the work of making the docs inaccurate.

I'd mark it ready for committer, but since I also attached a suggested
replacement patch it seems presumptuous to do that.

Cheers,

Jeff

Attachments:

bt_page_items_rightmost_jj.patchapplication/octet-stream; name=bt_page_items_rightmost_jj.patchDownload
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 6f51dc6..502903c 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -192,6 +192,17 @@ test=# SELECT * FROM bt_page_items('pg_cast_oid_index', 1);
           7 | (0,7)   |      12 | f     | f    | 29 27 00 00
           8 | (0,8)   |      12 | f     | f    | 2a 27 00 00
 </screen>
+      In a B-tree leaf page, <structfield>ctid</> points to a heap
+      tuple.  In an internal page, it points to another page in the
+      index itself, and the offset number part (the second number) of
+      the <structfield>ctid</> field is ignored.  Note that the first
+      item on any non-rightmost page (any page with a non-zero value in the
+      <structfield>btpo_next</> field) is the page <quote>high
+      key</quote>, meaning its <structfield>data</> serves as an upper
+      bound on all items appearing on the page.  Also, on non-leaf
+      pages, the first "data item" (the first item that is not a high
+      key) is a <quote>minus infinity</quote> item, implying that its
+      <structfield>data</> field lacks any value.
      </para>
     </listitem>
    </varlistentry>
#11Peter Geoghegan
pg@heroku.com
In reply to: Jeff Janes (#10)
Re: Documentation of bt_page_items()'s ctid field

On Wed, Mar 11, 2015 at 3:14 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I'd mark it ready for committer, but since I also attached a suggested
replacement patch it seems presumptuous to do that.

I've marked it "ready for committer". I think your version is slightly better.

Thanks
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#10)
Re: Documentation of bt_page_items()'s ctid field

Jeff Janes <jeff.janes@gmail.com> writes:

I think we do want this. It was asked how much we want to include
internals of the btree code into pageinspect documentation, but I think the
nature of pageinspect makes that unavoidable.

Yeah. Committed with a little bit of additional wordsmithing.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers