SP-GiST confusing introductory paragraph

Started by PG Bug reporting formover 2 years ago3 messagesdocs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/spgist-intro.html
Description:

I'm confused by this paragraph:

These popular data structures were originally developed for in-memory

usage. In main memory, they are usually designed as a set of dynamically
allocated nodes linked by pointers. This is not suitable for direct storing
on disk, since these chains of pointers can be rather long which would
require too many disk accesses. In contrast, disk-based data structures
should have a high fanout to minimize I/O. The challenge addressed by
SP-GiST is to map search tree nodes to disk pages in such a way that a
search need access only a few disk pages, even if it traverses many nodes.

In particular, "These popular datastructures" is ambiguous -- based on how
the previous paragraph ends, it sounds like the "popular datastructures" or
SP-GiSTs, but then it goes on to say they were designed for in-memory, and
then it mentions that SP-GiST's space partitioning (with high fanout) is
more appropriate to minimize disk access. I think maybe the solution here
would be to replace "These popular data structures" with something like
"Classic Postgres indexes such as B-tree indexes..." or something like
that.

Also, I think a short parenthetical definition of "fanout" would be useful
here, something like "high fanout (i.e. where each node has a potentially
large number of child nodes that reside in the same disk page)". Took me
some googling to realize what fanout meant in this context.

Best,
Alex

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: PG Bug reporting form (#1)
Re: SP-GiST confusing introductory paragraph

On 2023-Oct-16, PG Doc comments form wrote:

I'm confused by this paragraph:

These popular data structures were originally developed for in-memory
usage. In main memory, they are usually designed as a set of dynamically
allocated nodes linked by pointers. This is not suitable for direct storing
on disk, since these chains of pointers can be rather long which would
require too many disk accesses. In contrast, disk-based data structures
should have a high fanout to minimize I/O. The challenge addressed by
SP-GiST is to map search tree nodes to disk pages in such a way that a
search need access only a few disk pages, even if it traverses many nodes.

In particular, "These popular datastructures" is ambiguous -- based on how
the previous paragraph ends, it sounds like the "popular datastructures" or
SP-GiSTs, but then it goes on to say they were designed for in-memory, and
then it mentions that SP-GiST's space partitioning (with high fanout) is
more appropriate to minimize disk access. I think maybe the solution here
would be to replace "These popular data structures" with something like
"Classic Postgres indexes such as B-tree indexes..." or something like
that.

Yeah, this paragraph is a rewording of Oleg's[1]http://www.sai.msu.su/~megera/wiki/spgist_dev

SP-GiST is an abbreviation of space-partitioned GiST - the search
tree, which allows to implement a wide range of different
non-balanced disk-based data structures, such as quadtree, kd-tree,
tries - popular data structures, originally developed for memory
storage. Main memory access structures usually designed as a set of
dynamically allocated nodes linked by pointers, which is not suitable
for direct storing on disk, since these chains of pointers can be
rather long and require too many disk accesses. In opposite, disk
based data structures have a high fanout to minimize I/O. The
challenge is to map nodes of tree to disk pages in such a way, so
search algorithm accesses only a few disk pages, even if it traverse
many nodes.

where the point is (IMO) much clearer.

[1]: http://www.sai.msu.su/~megera/wiki/spgist_dev

Also, I think a short parenthetical definition of "fanout" would be useful
here, something like "high fanout (i.e. where each node has a potentially
large number of child nodes that reside in the same disk page)". Took me
some googling to realize what fanout meant in this context.

Hmm. We also use the term (hypenated as fan-out) in the reference to
recursive_worktable_factor. Maybe we should list it in the glossary in
a way that works for both.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)

#3Alex Matchneer
amatchneer@gmail.com
In reply to: Alvaro Herrera (#2)
Re: SP-GiST confusing introductory paragraph

I realized later that "these data structures" was referring to
"non-balanced data structures, such as quad-trees, k-d trees, and radix
trees (tries)" from the previous paragraph, and not SP-GiST, which isn't a
data structure per se. I feel a bit silly about this confusion and am not
sure others would have the same confusion I did.

Alex

On Mon, Oct 16, 2023 at 11:09 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

Show quoted text

On 2023-Oct-16, PG Doc comments form wrote:

I'm confused by this paragraph:

These popular data structures were originally developed for in-memory
usage. In main memory, they are usually designed as a set of

dynamically

allocated nodes linked by pointers. This is not suitable for direct

storing

on disk, since these chains of pointers can be rather long which would
require too many disk accesses. In contrast, disk-based data structures
should have a high fanout to minimize I/O. The challenge addressed by
SP-GiST is to map search tree nodes to disk pages in such a way that a
search need access only a few disk pages, even if it traverses many

nodes.

In particular, "These popular datastructures" is ambiguous -- based on

how

the previous paragraph ends, it sounds like the "popular datastructures"

or

SP-GiSTs, but then it goes on to say they were designed for in-memory,

and

then it mentions that SP-GiST's space partitioning (with high fanout) is
more appropriate to minimize disk access. I think maybe the solution here
would be to replace "These popular data structures" with something like
"Classic Postgres indexes such as B-tree indexes..." or something like
that.

Yeah, this paragraph is a rewording of Oleg's[1]

SP-GiST is an abbreviation of space-partitioned GiST - the search
tree, which allows to implement a wide range of different
non-balanced disk-based data structures, such as quadtree, kd-tree,
tries - popular data structures, originally developed for memory
storage. Main memory access structures usually designed as a set of
dynamically allocated nodes linked by pointers, which is not suitable
for direct storing on disk, since these chains of pointers can be
rather long and require too many disk accesses. In opposite, disk
based data structures have a high fanout to minimize I/O. The
challenge is to map nodes of tree to disk pages in such a way, so
search algorithm accesses only a few disk pages, even if it traverse
many nodes.

where the point is (IMO) much clearer.

[1] http://www.sai.msu.su/~megera/wiki/spgist_dev

Also, I think a short parenthetical definition of "fanout" would be

useful

here, something like "high fanout (i.e. where each node has a potentially
large number of child nodes that reside in the same disk page)". Took me
some googling to realize what fanout meant in this context.

Hmm. We also use the term (hypenated as fan-out) in the reference to
recursive_worktable_factor. Maybe we should list it in the glossary in
a way that works for both.

--
Álvaro Herrera PostgreSQL Developer —
https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el
sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)