Maintaining cluster order on insert
While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples. That involves asking the index am where on the heap the
new tuple should go, and trying to insert it there before using the FSM.
Using the new fillfactor parameter makes it more likely that there's
room on the page. We don't worry about the order within the page.
The API I'm thinking of introduces a new optional index am function,
amsuggestblock (suggestions for a better name are welcome). It gets the
same parameters as aminsert, and returns the heap block number that
would be optimal place to put the new tuple. It's be called from
ExecInsert before inserting the heap tuple, and the suggestion is passed
on to heap_insert and RelationGetBufferForTuple.
I wrote a little patch to implement this for btree, attached.
This could be optimized by changing the existing aminsert API, because
as it is, an insert will have to descend the btree twice. Once in
amsuggestblock and then in aminsert. amsuggestblock could keep the right
index page pinned so aminsert could locate it quicker. But I wanted to
keep this simple for now. Another improvement might be to allow
amsuggestblock to return a list of suggestions, but that makes it more
expensive to insert if there isn't room in the suggested pages, since
heap_insert will have to try them all before giving up.
Comments regarding the general idea or the patch? There should probably
be a index option to turn the feature on and off. You'll want to turn it
off when you first load a table, and turn it on after CLUSTER to keep it
clustered.
Since there's been discussion on keeping the TODO list more up-to-date,
I hereby officially claim the "Automatically maintain clustering on a
table" TODO item :). Feel free to bombard me with requests for status
reports. And just to be clear, I'm not trying to sneak this into 8.2
anymore, this is 8.3 stuff.
I won't be implementing a background daemon described on the TODO item,
since that would essentially be an online version of CLUSTER. Which sure
would be nice, but that's a different story.
- Heikki
Attachments:
clustermaintenance.difftext/x-patch; name=clustermaintenance.diffDownload+343-52
Heikki Linnakangas <heikki@enterprisedb.com> writes:
While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples.
Doesn't this happen for free if there's enough space? UPDATE tries to
place the new tuple on the same page it's already on. In practice
people are only likely to cluster by primary keys (or other things that
seldom change) so I don't particularly agree with inventing a large wart
to support "optimally" placing things somewhere else...
regards, tom lane
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
UPDATE tries to place the new tuple on the same page it's already
on.
I think he meant for INSERT.
--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/
I have a table that inserts lots of rows (million+ per day) int8 as primary
key, and I cluster by a timestamp which is approximately the timestamp of
the insert beforehand and is therefore in increasing order and doesn't
change. Most of the rows are updated about 3 times over time roughly within
the next 30 minutes. Should I assume that that all of these updates will be
on separate pages unless I perform a cluster (which takes a long time) and
performance will suffer due to this? Is it possible to preallocate space on
the same page for future updates (whatever the average number of updates may
be per row) decreasing the number of page loads for querying?
Gene
On 8/9/06, Jonah H. Harris <jonah.harris@gmail.com> wrote:
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
UPDATE tries to place the new tuple on the same page it's already
on.I think he meant for INSERT.
--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
--
Eugene Hart
Gene <genekhart@gmail.com> writes:
I have a table that inserts lots of rows (million+ per day) int8 as primary
key, and I cluster by a timestamp which is approximately the timestamp of
the insert beforehand and is therefore in increasing order and doesn't
change. Most of the rows are updated about 3 times over time roughly within
the next 30 minutes.
ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.
The main problem you're going to have is the update-3-times bit. You
could keep updated rows on the same page as the original if you ran the
table at fillfactor 25% (which you'll be able to do in 8.2) ... but
while this might be sane for the leading edge of the table, you hardly
want such low storage density in the stable part.
You could reduce the fillfactor requirement if you could vacuum the
table constantly (every 10 minutes or so) but I assume the table is
large enough to make that unattractive. (Eventually we should have
a version of vacuum that understands where the dirty stuff is, which
might make this approach tenable ... but not in 8.2.)
Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.
regards, tom lane
You are correct the main part I'm worried about is the updates, being so far
from the originals. fyi I am partitioning the tables by the timestamp
column,vacuum analyzing once per hour, creating one child partition per day
in a cron job. Because I'm using hibernate for database abstraction
(stateless sessions), I can only have one RULE since having more than one
insert rule will not return the correct number of updated rows which
confuses hibernate. The one rule just directs inserts to the latest child
partition which seems to work well.
The reason I'm doing the clustering is I was hoping that with the "stable"
non-updating partitions I could execute a CLUSTER at night (slow...) and it
would compact the tables into their most efficient state for querying which
always involves a date range. bad idea? In this fillfactor feature, will you
be able to set it to 100% once you know that no more updates will occur? Or
will doing a cluster effectively do this? Will the fill factor only apply
for inserts?
"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."
This sounds interesting, I could create a RULE/INSERT on the unstable table,
I will know during the update if it is ready to be put in the stable table.
What would be an efficient way to do the transfer? Since the updates occur
somewhat randomly, wouldnt the tuples in the stable table then be out of
natural timestamp order?
thanks for all of your help and comments! it is greatly appreciated!
Gene Hart
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Gene <genekhart@gmail.com> writes:
I have a table that inserts lots of rows (million+ per day) int8 as
primary
key, and I cluster by a timestamp which is approximately the timestamp
of
the insert beforehand and is therefore in increasing order and doesn't
change. Most of the rows are updated about 3 times over time roughlywithin
the next 30 minutes.
ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.The main problem you're going to have is the update-3-times bit. You
could keep updated rows on the same page as the original if you ran the
table at fillfactor 25% (which you'll be able to do in 8.2) ... but
while this might be sane for the leading edge of the table, you hardly
want such low storage density in the stable part.You could reduce the fillfactor requirement if you could vacuum the
table constantly (every 10 minutes or so) but I assume the table is
large enough to make that unattractive. (Eventually we should have
a version of vacuum that understands where the dirty stuff is, which
might make this approach tenable ... but not in 8.2.)Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.regards, tom lane
--
Eugene Hart
Jonah H. Harris wrote:
On 8/9/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
UPDATE tries to place the new tuple on the same page it's already
on.I think he meant for INSERT.
Right. Update is indeed taken care of already.
One example where this would help would be a customer_history table that
stores all transactions of a customer. Primary key is (customer_id,
transaction_id). You would want to keep the table clustered by
customer_id to make it quick to retrieve all transactions of a customer.
In general, any table with more or less random insert/delete activity
that you want to keep in order would benefit.
- Heikki
Gene wrote:
You are correct the main part I'm worried about is the updates, being
so far from the originals.
Yeah, you won't benefit from the patch at all.
The reason I'm doing the clustering is I was hoping that with the
"stable" non-updating partitions I could execute a CLUSTER at night
(slow...) and it would compact the tables into their most efficient
state for querying which always involves a date range. bad idea? In
this fillfactor feature, will you be able to set it to 100% once you
know that no more updates will occur? Or will doing a cluster
effectively do this? Will the fill factor only apply for inserts?
That sounds like a good approach. CLUSTER obeys the fillfactor, so
you'll want to set it to 100 for the older partitions before you CLUSTER.
You might want to experiment with the fillfactor. You might get the best
performance if you just set it to 100 even for the latest partition, if
your queries usually have to scan most of it anyway. Fillfactor 100 will
help to keep it dense and in memory, so it won't matter so much if it's
disorganized.
"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."This sounds interesting, I could create a RULE/INSERT on the unstable
table, I will know during the update if it is ready to be put in the
stable table. What would be an efficient way to do the transfer? Since
the updates occur somewhat randomly, wouldnt the tuples in the stable
table then be out of natural timestamp order?
I'm not sure I understand the last sentence. I thought the updates
usually occur within 30 minutes of the insert. So if you transfer the
rows to the stable table after 30 minutes, there won't be updates to the
stable table.
- Heikki
Gene <genekhart@gmail.com> writes:
"Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care."This sounds interesting, I could create a RULE/INSERT on the unstable table,
I will know during the update if it is ready to be put in the stable table.
What would be an efficient way to do the transfer? Since the updates occur
somewhat randomly, wouldnt the tuples in the stable table then be out of
natural timestamp order?
You may find it easier to handle some of the logic in a low level application
layer or layer of stored procedures rather than trying to make it entirely
transparent with rules. If you do want it to be transparent you might also
consider whether you want triggers instead of rules.
Another direction you might want to consider is whether the columns that
you're updating would be more normalized in a separate table. You might really
want to have a record of those past states as well. So you might find having
three records in this other table for each of your regular records in your
main table might actually work out better.
Even if you only have a 1-1 relationship sometimes this kind of horizontal
partitioning (or do people consider this vertical partitioning?) is still
worthwhile. If the columns being updated are very small or often not needed at
all then it may be reasonably efficient to look them up separately and still
let you store the bulk of the data efficiently and access it in a fast
sequential scan.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote:
Gene <genekhart@gmail.com> writes:
I have a table that inserts lots of rows (million+ per day) int8 as primary
key, and I cluster by a timestamp which is approximately the timestamp of
the insert...ISTM you should hardly need to worry about clustering that --- the data
will be in timestamp order pretty naturally.
In my case my biggest/slowest tables are clustered by zip-code (which
does a reasonable job at keeping counties/cities/etc on the
same pages too). Data comes in constantly (many records per minute, as
we ramp up), pretty uniformly across the country; but most queries
are geographically bounded. The data's pretty much insert-only.
If I understand Heikki's patch, it would help for this use case.
Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.
Hmm... that should work well for me too. Not sure if the use-case
I mentioned above is still compelling anymore; since this seems like
it'd give me much of the benefit; and I don't need an excessive
fillfactor on the stable part of the table.
Ron Mayer wrote:
In my case my biggest/slowest tables are clustered by zip-code (which
does a reasonable job at keeping counties/cities/etc on the
same pages too). Data comes in constantly (many records per minute, as
we ramp up), pretty uniformly across the country; but most queries
are geographically bounded. The data's pretty much insert-only.
No deletes? If the tables grow over time, you probably would need to run
CLUSTER every now and then to get the best performance, though the patch
would alleviate that quite a lot.
Do you have a development environment where you could test what effect
the patch would have? It would be interesting to have a real-world use
case, since I don't have one myself at the moment.
If I understand Heikki's patch, it would help for this use case.
Yes, it would.
Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data, and transfer rows from
one to the other once they get stable. Storage density in the "fresh"
part would be poor, but it should be small enough you don't care.Hmm... that should work well for me too. Not sure if the use-case
I mentioned above is still compelling anymore; since this seems like
it'd give me much of the benefit; and I don't need an excessive
fillfactor on the stable part of the table.
Umm, if your inserts are uniformly distributed across the country, you
wouldn't have a stable part, right?
- Heikki
Heikki Linnakangas wrote:
Ron Mayer wrote:
In my case my biggest/slowest tables are clustered by zip-code (which
does a reasonable job at keeping counties/cities/etc on the
same pages too)....No deletes? If the tables grow over time, you probably would need to run
CLUSTER every now and then to get the best performance, though the patch
would alleviate that quite a lot.
Yup, pretty much no deletes; since it's a historical archive of
some government documents with address info. Though I the
live system may periodically expunge data, say, 10+years old.
Do you have a development environment where you could test what effect
the patch would have? It would be interesting to have a real-world use
case, since I don't have one myself at the moment.
I have a development environment, but it doesn't have the same
real-time-growing behavior, and only a small region of the country.
I suppose I could pre-load N-1 years and cluster it, and then
incrementally insert the last year of data to simulate the effect.
But sure, I'll attempt to try the patch; but don't really have any
good benchmarking environment to give any definitive results. If
an anecdotal "this is how it feels to me" is useful, I can give one
of those.
Your best bet might be to partition the table into two subtables, one
with "stable" data and one with the fresh data.Hmm... that should work well for me too....
Umm, if your inserts are uniformly distributed across the country, you
wouldn't have a stable part, right?
Hmm. Maybe. I was thinking when archiving to the large table
an "order by" clause when inserting from the new partition to the
stable partition could at least make the big table "piecewise"
clustered so most records for a zip code fit in the same few disk
pages, even though those pages would still end up lying around
far apart on the disk.
I wonder what part of "CLUSTER" gives the most benefit - that
most records of a type fit on a few blocks; or that those blocks
are next to each other so can be read sequentially?
[ Sorry I found this one only found recently.]
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---------------------------------------------------------------------------
Heikki Linnakangas wrote:
While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples. That involves asking the index am where on the heap the
new tuple should go, and trying to insert it there before using the FSM.
Using the new fillfactor parameter makes it more likely that there's
room on the page. We don't worry about the order within the page.The API I'm thinking of introduces a new optional index am function,
amsuggestblock (suggestions for a better name are welcome). It gets the
same parameters as aminsert, and returns the heap block number that
would be optimal place to put the new tuple. It's be called from
ExecInsert before inserting the heap tuple, and the suggestion is passed
on to heap_insert and RelationGetBufferForTuple.I wrote a little patch to implement this for btree, attached.
This could be optimized by changing the existing aminsert API, because
as it is, an insert will have to descend the btree twice. Once in
amsuggestblock and then in aminsert. amsuggestblock could keep the right
index page pinned so aminsert could locate it quicker. But I wanted to
keep this simple for now. Another improvement might be to allow
amsuggestblock to return a list of suggestions, but that makes it more
expensive to insert if there isn't room in the suggested pages, since
heap_insert will have to try them all before giving up.Comments regarding the general idea or the patch? There should probably
be a index option to turn the feature on and off. You'll want to turn it
off when you first load a table, and turn it on after CLUSTER to keep it
clustered.Since there's been discussion on keeping the TODO list more up-to-date,
I hereby officially claim the "Automatically maintain clustering on a
table" TODO item :). Feel free to bombard me with requests for status
reports. And just to be clear, I'm not trying to sneak this into 8.2
anymore, this is 8.3 stuff.I won't be implementing a background daemon described on the TODO item,
since that would essentially be an online version of CLUSTER. Which sure
would be nice, but that's a different story.- Heikki
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Ah, thanks! I had forgotten about it as well.
Bruce Momjian wrote:
[ Sorry I found this one only found recently.]
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.---------------------------------------------------------------------------
Heikki Linnakangas wrote:
While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples. That involves asking the index am where on the heap the
new tuple should go, and trying to insert it there before using the FSM.
Using the new fillfactor parameter makes it more likely that there's
room on the page. We don't worry about the order within the page.The API I'm thinking of introduces a new optional index am function,
amsuggestblock (suggestions for a better name are welcome). It gets the
same parameters as aminsert, and returns the heap block number that
would be optimal place to put the new tuple. It's be called from
ExecInsert before inserting the heap tuple, and the suggestion is passed
on to heap_insert and RelationGetBufferForTuple.I wrote a little patch to implement this for btree, attached.
This could be optimized by changing the existing aminsert API, because
as it is, an insert will have to descend the btree twice. Once in
amsuggestblock and then in aminsert. amsuggestblock could keep the right
index page pinned so aminsert could locate it quicker. But I wanted to
keep this simple for now. Another improvement might be to allow
amsuggestblock to return a list of suggestions, but that makes it more
expensive to insert if there isn't room in the suggested pages, since
heap_insert will have to try them all before giving up.Comments regarding the general idea or the patch? There should probably
be a index option to turn the feature on and off. You'll want to turn it
off when you first load a table, and turn it on after CLUSTER to keep it
clustered.Since there's been discussion on keeping the TODO list more up-to-date,
I hereby officially claim the "Automatically maintain clustering on a
table" TODO item :). Feel free to bombard me with requests for status
reports. And just to be clear, I'm not trying to sneak this into 8.2
anymore, this is 8.3 stuff.I won't be implementing a background daemon described on the TODO item,
since that would essentially be an online version of CLUSTER. Which sure
would be nice, but that's a different story.- Heikki
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could at least
try and insert close to the page you originally wanted.
On Tue, May 15, 2007 at 11:26:51PM +0100, Heikki Linnakangas wrote:
Ah, thanks! I had forgotten about it as well.
Bruce Momjian wrote:
[ Sorry I found this one only found recently.]
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.---------------------------------------------------------------------------
Heikki Linnakangas wrote:
While thinking about index-organized-tables and similar ideas, it
occurred to me that there's some low-hanging-fruit: maintaining cluster
order on inserts by trying to place new heap tuples close to other
similar tuples. That involves asking the index am where on the heap the
new tuple should go, and trying to insert it there before using the FSM.
Using the new fillfactor parameter makes it more likely that there's
room on the page. We don't worry about the order within the page.The API I'm thinking of introduces a new optional index am function,
amsuggestblock (suggestions for a better name are welcome). It gets the
same parameters as aminsert, and returns the heap block number that
would be optimal place to put the new tuple. It's be called from
ExecInsert before inserting the heap tuple, and the suggestion is passed
on to heap_insert and RelationGetBufferForTuple.I wrote a little patch to implement this for btree, attached.
This could be optimized by changing the existing aminsert API, because
as it is, an insert will have to descend the btree twice. Once in
amsuggestblock and then in aminsert. amsuggestblock could keep the right
index page pinned so aminsert could locate it quicker. But I wanted to
keep this simple for now. Another improvement might be to allow
amsuggestblock to return a list of suggestions, but that makes it more
expensive to insert if there isn't room in the suggested pages, since
heap_insert will have to try them all before giving up.Comments regarding the general idea or the patch? There should probably
be a index option to turn the feature on and off. You'll want to turn it
off when you first load a table, and turn it on after CLUSTER to keep it
clustered.Since there's been discussion on keeping the TODO list more up-to-date,
I hereby officially claim the "Automatically maintain clustering on a
table" TODO item :). Feel free to bombard me with requests for status
reports. And just to be clear, I'm not trying to sneak this into 8.2
anymore, this is 8.3 stuff.I won't be implementing a background daemon described on the TODO item,
since that would essentially be an online version of CLUSTER. Which sure
would be nice, but that's a different story.- Heikki
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
--
Jim Nasby decibel@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote:
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could at least
try and insert close to the page you originally wanted.
Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Jim C. Nasby wrote:
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could at least
try and insert close to the page you originally wanted.Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.
the patch doesn't apply in cvs... you'll need to update it...
--
regards,
Jaime Casanova
"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
Richard Cook
On 5/17/07, Jaime Casanova <systemguards@gmail.com> wrote:
On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Jim C. Nasby wrote:
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could at least
try and insert close to the page you originally wanted.Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.the patch doesn't apply in cvs... you'll need to update it...
cvs i wrote? head i was meaning... sorry, too late for me =)
--
regards,
Jaime Casanova
"Programming today is a race between software engineers striving to
build bigger and better idiot-proof programs and the universe trying
to produce bigger and better idiots.
So far, the universe is winning."
Richard Cook
Jaime Casanova wrote:
On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Jim C. Nasby wrote:
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could atleast
try and insert close to the page you originally wanted.
Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.the patch doesn't apply in cvs... you'll need to update it...
Oh, here you are.
The implementation has changed a bit since August. I thought I had
submitted an updated version in the winter but couldn't find it. Anyway,
I updated and dusted off the source tree, tidied up the comments a
little bit, and fixed some inconsistencies in pg_proc entries that made
opr_sanity to fail.
The beef of the patch is two new optional indexam API functions:
amprepareinsert and amfinishinsert. amprepareinsert is called before
inserting the heap tuple. It descends the tree and finds and pins the
right leaf page to insert to, and returns a suggestion on where the heap
tuple should be inserted. amfinishinsert is called after inserting the
heap tuple to actually insert the index tuple. Documentation for these
functions need to be added indexam.sgml, I noticed that that's not done yet.
The cluster_inserts GUC option that you can use to enable/disable the
feature should be removed before committing.
The performance characteristics of this patch hasn't been thoroughly
discussed yet. The reason why you want to cluster your tables is to
speed up SELECTs that return a bunch of tuples with similar values, for
example range queries. The reason for keeping them clustered on inserts
is to reduce the need to run CLUSTER as often.
It doesn't come without a cost, however. In the worst case, there never
is room for new inserts on pages, and each insert needs to do one extra
I/O to fetch the optimal heap page where the insert should go, see that
there's no room, and then insert somewhere else. Using a non-zero
fillfactor helps, but even when there is room on the page, it's often
cheaper to just append to the end of the table and running CLUSTER at
night for example, than do random access to insert to the "right" pages
in the heap.
So, should we have a WITH-option on the table to enable/disable this
feature, and what would be the default?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
maintain_cluster_order_v7.patchtext/x-diff; name=maintain_cluster_order_v7.patchDownload+465-114
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---------------------------------------------------------------------------
Heikki Linnakangas wrote:
Jaime Casanova wrote:
On 5/16/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:
Jim C. Nasby wrote:
What about adding the ability to ask the FSM for a page that's near a
given page? That way if you did have to go to the FSM you could atleast
try and insert close to the page you originally wanted.
Yeah, there's always room for improvement. I made the patch when I was
working on clustered indexes, and was mostly concerned about getting
inserts to the same page as other tuples with similar values so that the
clustered index stays clustered.the patch doesn't apply in cvs... you'll need to update it...
Oh, here you are.
The implementation has changed a bit since August. I thought I had
submitted an updated version in the winter but couldn't find it. Anyway,
I updated and dusted off the source tree, tidied up the comments a
little bit, and fixed some inconsistencies in pg_proc entries that made
opr_sanity to fail.The beef of the patch is two new optional indexam API functions:
amprepareinsert and amfinishinsert. amprepareinsert is called before
inserting the heap tuple. It descends the tree and finds and pins the
right leaf page to insert to, and returns a suggestion on where the heap
tuple should be inserted. amfinishinsert is called after inserting the
heap tuple to actually insert the index tuple. Documentation for these
functions need to be added indexam.sgml, I noticed that that's not done yet.The cluster_inserts GUC option that you can use to enable/disable the
feature should be removed before committing.The performance characteristics of this patch hasn't been thoroughly
discussed yet. The reason why you want to cluster your tables is to
speed up SELECTs that return a bunch of tuples with similar values, for
example range queries. The reason for keeping them clustered on inserts
is to reduce the need to run CLUSTER as often.It doesn't come without a cost, however. In the worst case, there never
is room for new inserts on pages, and each insert needs to do one extra
I/O to fetch the optimal heap page where the insert should go, see that
there's no room, and then insert somewhere else. Using a non-zero
fillfactor helps, but even when there is room on the page, it's often
cheaper to just append to the end of the table and running CLUSTER at
night for example, than do random access to insert to the "right" pages
in the heap.So, should we have a WITH-option on the table to enable/disable this
feature, and what would be the default?--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +