Grouped Index Tuples / Clustered Indexes

Started by Heikki Linnakangasover 19 years ago13 messageshackers

heikki.linnakangas@enterprisedb.com

over 19 years ago

I've updated the GIT patch at http://community.enterprisedb.com/git/.
Bitrot caused by the findinsertloc-patch has been fixed, making that
part of the GIT patch a little bit smaller and cleaner. I also did some
refactoring, and minor cleanup and commenting.

Any comments on the design or patch? For your convenience, I copied the
same text I added to access/nbtree/README to
http://community.enterprisedb.com/git/git-readme.txt

Should we start playing the name game at this point? I've been thinking
we should call this feature just Clustered Indexes, even though it's not
exactly the same thing as clustered indexes in other DBMSs. From user
point of view, they behave similarly enough that it may be best to use
the existing term.

As a next step, I'm hoping to get the indexam API changes from the
bitmap index patch committed soon, and in a way that supports GIT as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Grzegorz Jaskiewicz

gj@pointblue.com.pl

over 19 years ago

In reply to: Heikki Linnakangas (#1)

Re: Grouped Index Tuples / Clustered Indexes

my only question would be.
Why isn't that in core already ?

Luke Lonergan

llonergan@greenplum.com

over 19 years ago

In reply to: Grzegorz Jaskiewicz (#2)

Re: Grouped Index Tuples / Clustered Indexes

On 3/7/07 6:53 AM, "Grzegorz Jaskiewicz" <gj@pointblue.com.pl> wrote:

Show quoted text

my only question would be.
Why isn't that in core already ?

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Simon Riggs

simon@2ndQuadrant.com

over 19 years ago

In reply to: Heikki Linnakangas (#1)

Re: Grouped Index Tuples / Clustered Indexes

On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:

I've been thinking
we should call this feature just Clustered Indexes

Works for me.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Simon Riggs (#4)

Re: Grouped Index Tuples / Clustered Indexes

On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:

I've been thinking
we should call this feature just Clustered Indexes

So we would have "clustered tables" which are tables whose heap is ordered
according to an index and separately "clustered indexes" which are indexes
optimized for such tables?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 19 years ago

In reply to: Bruce Momjian (#5)

Re: Grouped Index Tuples / Clustered Indexes

Gregory Stark wrote:

On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:

I've been thinking
we should call this feature just Clustered Indexes

So we would have "clustered tables" which are tables whose heap is ordered
according to an index and separately "clustered indexes" which are indexes
optimized for such tables?

Yes, that's what I was thinking.

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Florian Pflug

fgp@phlo.org

over 19 years ago

In reply to: Heikki Linnakangas (#6)

Re: Grouped Index Tuples / Clustered Indexes

Heikki Linnakangas wrote:

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

This reminds me of something i've been wondering about for quite some
time. Why is it that one has to write "cluster <index> on <table>",
and not "cluster <table> on <index>"?

To me, the second variant would seem more logical, but then I'm
not a native english speaker...

I'm not suggesting that this should be changed, I'm just wondering
why it is the way it is.

greetings, Florian Pflug

Simon Riggs

simon@2ndQuadrant.com

over 19 years ago

In reply to: Heikki Linnakangas (#6)

Re: Grouped Index Tuples / Clustered Indexes

On Sun, 2007-03-11 at 11:22 +0000, Heikki Linnakangas wrote:

Gregory Stark wrote:

On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:

I've been thinking
we should call this feature just Clustered Indexes

So we would have "clustered tables" which are tables whose heap is ordered
according to an index and separately "clustered indexes" which are indexes
optimized for such tables?

Yes, that's what I was thinking.

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

First thought: we can use the term "cluster*ing* index" for CLUSTER and
use the term "clustered" to refer to what has happened to the table and
the index. That will probably be confused with high availability
clustering, so perhaps not.

Better thought: say that CLUSTER requires an "order-defining index".
That better explains the point that it is the table being clustered,
using the index to define the physical order of the rows in the heap. We
then use the word "clustered" to refer to what has happened to the
table, and with this patch, for the index also.

That way we can have new syntax for CLUSTER

CLUSTER table ORDER BY indexname

which is then the preferred syntax, rather than the perverse

CLUSTER index ON table

which gives the wrong impression about what is happening, since it is
the table that is changed, not the index.

- - -

- Are you suggesting that we have an explicit new syntax

CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...

or just that we refer to this feature as Clustered Indexes?

- Do we still need the index WITH option, in either case?

- Do you think that all Primary Keys should be clustered?

- Are you thinking to rename docs, catalog etc to reflect the new
naming/meaning?

My thinking would be: CLUSTERED, no, yes, yes
but I'd like to know what you think?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 19 years ago

In reply to: Florian Pflug (#7)

Re: Grouped Index Tuples / Clustered Indexes

On Sun, 2007-03-11 at 19:06 +0100, Florian G. Pflug wrote:

Heikki Linnakangas wrote:

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

This reminds me of something i've been wondering about for quite some
time. Why is it that one has to write "cluster <index> on <table>",
and not "cluster <table> on <index>"?

To me, the second variant would seem more logical, but then I'm
not a native english speaker...

I'm not suggesting that this should be changed, I'm just wondering
why it is the way it is.

No idea, but I agree it conveys exactly the opposite view of what
happens when the command is issued.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 19 years ago

In reply to: Simon Riggs (#8)

Re: Grouped Index Tuples / Clustered Indexes

Simon Riggs wrote:

Better thought: say that CLUSTER requires an "order-defining index".
That better explains the point that it is the table being clustered,
using the index to define the physical order of the rows in the heap. We
then use the word "clustered" to refer to what has happened to the
table, and with this patch, for the index also.

That way we can have new syntax for CLUSTER

CLUSTER table ORDER BY indexname

which is then the preferred syntax, rather than the perverse

CLUSTER index ON table

which gives the wrong impression about what is happening, since it is
the table that is changed, not the index.

I like that, "order-defining index" conveys the point pretty well.

- Are you suggesting that we have an explicit new syntax

CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...

or just that we refer to this feature as Clustered Indexes?

I'm not proposing new syntax, just a WITH-parameter. Makes more sense to
me that way, the clusteredness has no user-visible effects except
performance, and it's b-tree specific (though I guess you could apply
the same concept to other indexams as well).

- Do you think that all Primary Keys should be clustered?

No. There's a significant CPU overhead when the index and table are in
memory and you're doing simple one-row lookups. And there's no promise
that a table is physically in primary key order anyway.

There might be some interesting cases where we could enable it
automatically. I've been thinking that if you explicitly CLUSTER a
table, the order-defining index would definitely benefit from being a
clustered index. If it's small enough that it fits in memory, there's no
point in running CLUSTER in the first place. And if you run CLUSTER, we
know it's in order. That seems like a pretty safe bet.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Simon Riggs (#9)

Re: Grouped Index Tuples / Clustered Indexes

Simon Riggs wrote:

On Sun, 2007-03-11 at 19:06 +0100, Florian G. Pflug wrote:

Heikki Linnakangas wrote:

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

This reminds me of something i've been wondering about for quite some
time. Why is it that one has to write "cluster <index> on <table>",
and not "cluster <table> on <index>"?

To me, the second variant would seem more logical, but then I'm
not a native english speaker...

I'm not suggesting that this should be changed, I'm just wondering
why it is the way it is.

No idea, but I agree it conveys exactly the opposite view of what
happens when the command is issued.

We got the syntax from Berkely, and it has always seemed backwards to me
too.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#12

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Simon Riggs (#8)

Re: Grouped Index Tuples / Clustered Indexes

Added to TODO:

o Add more logical syntax CLUSTER table ORDER BY index;
support current syntax for backward compatibility

---------------------------------------------------------------------------

Simon Riggs wrote:

On Sun, 2007-03-11 at 11:22 +0000, Heikki Linnakangas wrote:

Gregory Stark wrote:

On Wed, 2007-03-07 at 10:32 +0000, Heikki Linnakangas wrote:

I've been thinking
we should call this feature just Clustered Indexes

So we would have "clustered tables" which are tables whose heap is ordered
according to an index and separately "clustered indexes" which are indexes
optimized for such tables?

Yes, that's what I was thinking.

There's a third related term in use as well. When you issue CLUSTER, the
table will be clustered on an index. And that index is then the "index
the table is clustered on". That's a bit cumbersome but that's the
terminology we're using at the moment. Maybe we should to come up with a
new term for that to avoid confusion..

First thought: we can use the term "cluster*ing* index" for CLUSTER and
use the term "clustered" to refer to what has happened to the table and
the index. That will probably be confused with high availability
clustering, so perhaps not.

Better thought: say that CLUSTER requires an "order-defining index".
That better explains the point that it is the table being clustered,
using the index to define the physical order of the rows in the heap. We
then use the word "clustered" to refer to what has happened to the
table, and with this patch, for the index also.

That way we can have new syntax for CLUSTER

CLUSTER table ORDER BY indexname

which is then the preferred syntax, rather than the perverse

CLUSTER index ON table

which gives the wrong impression about what is happening, since it is
the table that is changed, not the index.

- - -

- Are you suggesting that we have an explicit new syntax

CREATE [UNIQUE] CLUSTERED INDEX [CONCURRENTLY] fooidx ON foo (....) ...

or just that we refer to this feature as Clustered Indexes?

- Do we still need the index WITH option, in either case?

- Do you think that all Primary Keys should be clustered?

- Are you thinking to rename docs, catalog etc to reflect the new
naming/meaning?

My thinking would be: CLUSTERED, no, yes, yes
but I'd like to know what you think?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Heikki Linnakangas (#1)

Re: Grouped Index Tuples / Clustered Indexes

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:

I've updated the GIT patch at http://community.enterprisedb.com/git/.
Bitrot caused by the findinsertloc-patch has been fixed, making that
part of the GIT patch a little bit smaller and cleaner. I also did some
refactoring, and minor cleanup and commenting.

Any comments on the design or patch? For your convenience, I copied the
same text I added to access/nbtree/README to
http://community.enterprisedb.com/git/git-readme.txt

Should we start playing the name game at this point? I've been thinking
we should call this feature just Clustered Indexes, even though it's not
exactly the same thing as clustered indexes in other DBMSs. From user
point of view, they behave similarly enough that it may be best to use
the existing term.

As a next step, I'm hoping to get the indexam API changes from the
bitmap index patch committed soon, and in a way that supports GIT as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +