CLUSTER and indisclustered
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.
1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scan
Any reason this isn't possible? Any reason it couldn't dramatically speed
up the performance of the type of query i've mentioned?
Gavin
Gavin Sherry <swm@linuxworld.com.au> writes:
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans.
AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
that's only usable on a perfectly-clustered table.
The existing approach to making the planner smart about clustered tables
is to compute a physical-vs-logical-order-correlation statistic and use
that to adjust the estimated cost of indexscans. I believe this is a
more robust approach than considering a table to be "clustered" or "not
clustered", since it can deal with the gradual degradation of clustered
order over time. However, I will not make any great claims for the
specific equations currently used for this purpose --- they're surely in
need of improvement. Feel free to take a look and see if you have any
ideas. The collection of the statistic is in commands/analyze.c and the
use of it is in optimizer/path/costsize.c.
regards, tom lane
On Sat, 3 Aug 2002, Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans.AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
Sorry, I meant to say that heap_insert() etc would need to set
indisclustered to false.
I do see some worth in this however. Naturally, in a situation where a
database is being modified very often this is of little value. However,
for applications focussed on analysing large amounts of static data this
could increase performance significantly. Once I get some time I will
attempt to explore this further in `diff -c` format :-).
Gavin
Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans.AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
that's only usable on a perfectly-clustered table.The existing approach to making the planner smart about clustered tables
is to compute a physical-vs-logical-order-correlation statistic and use
that to adjust the estimated cost of indexscans. I believe this is a
more robust approach than considering a table to be "clustered" or "not
clustered", since it can deal with the gradual degradation of clustered
order over time. However, I will not make any great claims for the
specific equations currently used for this purpose --- they're surely in
need of improvement. Feel free to take a look and see if you have any
ideas. The collection of the statistic is in commands/analyze.c and the
use of it is in optimizer/path/costsize.c.
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Gavin Sherry wrote:
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scan
Gavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Sat, 3 Aug 2002, Bruce Momjian wrote:
Gavin Sherry wrote:
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scanGavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.
Bruce,
It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.
Gavin
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?
Could do that I suppose, but I'm not super-excited about it. ANALYZE is
quite cheap these days (especially in comparison to CLUSTER ;-)). I'd
settle for a note in the CLUSTER docs that recommends a subsequent
ANALYZE --- this seems no different from recommending ANALYZE after bulk
data load or other major update of a table.
regards, tom lane
Gavin Sherry wrote:
Gavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.Bruce,
It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.
Yes, but in a clustered table, an index scan is _never_ (?) more
expensive than a sequential scan, at least if the optimizer is working
correctly. Index scans are slower only because they assume random heap
access, but with a clustered table, there is no random heap access. The
index takes to right to the spot to start.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?Could do that I suppose, but I'm not super-excited about it. ANALYZE is
quite cheap these days (especially in comparison to CLUSTER ;-)). I'd
settle for a note in the CLUSTER docs that recommends a subsequent
ANALYZE --- this seems no different from recommending ANALYZE after bulk
data load or other major update of a table.
OK. I am sure it is not obvious to people to ANALYZE because the data
in their table hasn't changed, just the ordering.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Gavin Sherry <swm@linuxworld.com.au> writes:
On Sat, 3 Aug 2002, Tom Lane wrote:
AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
Sorry, I meant to say that heap_insert() etc would need to set
indisclustered to false.
<<itch>> You could do that, but only if you are prepared to invent
a mechanism that will instantly invalidate any existing query plans
that assume the clustered ordering is good.
Up to now we've only allowed the planner to make decisions that impact
performace, not correctness of the result. I'm uncomfortable with the
idea that a "clusterscan" plan could silently return wrong answers after
someone else updates the table and doesn't tell us they did.
regards, tom lane
Gavin Sherry wrote:
On Sat, 3 Aug 2002, Bruce Momjian wrote:
Gavin Sherry wrote:
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scanGavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.Bruce,
It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.Gavin
Gavin and Bruce,
I am not so sure index access in these cases is such an overhead - since
the clustered nature of the table means that many index elements will
point to table data that is located in a few pages .
Ok, this change would save you the initial access of the index structure
itself - but isnt
the usual killer for indexes is the "thrashing" that happens when the
"pointed to" table data is spread over a many pages.
I did some tests on this a while ago ( 7.1 era) and discovered that for
a clustered table a sequential scan did not start to win until the
target dataset was ~30% of the table itself.
While the suggested change would of course mean that seq scans would do
much better than they did in my tests, it would be interesting to see
some timings...
regards
Mark
On Sun, 4 Aug 2002, mark Kirkwood wrote:
Ok, this change would save you the initial access of the index
structure itself - but isnt the usual killer for indexes is the
"thrashing" that happens when the "pointed to" table data is spread
over a many pages.
Yeah, no kidding on this one. I've reduced queries from 75 seconds
to 0.6 seconds by clustering on the appropriate field.
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to? That would introduce a certain amount
of "sequentialness" to the reads that the OS is not capable of
introducing (since it can't reschedule the reads you're doing, the
way it could reschedule, say, random writes).
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
On Wed, 7 Aug 2002, Curt Sampson wrote:
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to? That would introduce a certain amount
of "sequentialness" to the reads that the OS is not capable of
introducing (since it can't reschedule the reads you're doing, the
way it could reschedule, say, random writes).
This sounds more or less like the method employed by Firebird as described
by Ann Douglas to Tom at OSCON (correct me if I get this wrong).
Basically, firebird populates a bitmap with entries the scan is interested
in. The bitmap is populated in page order so that all entries on the same
heap page can be fetched at once.
This is totally different to the way postgres does things and would
require significant modification to the index access methods.
Gavin
Curt Sampson <cjs@cynic.net> writes:
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to?
I thought to myself "didn't I just post something about that?"
and then realized it was on a different mailing list. Here ya go
(and no, this is not the first time around on this list either...)
I am currently thinking that bitmap indexes per se are not all that
interesting. What does interest me is bitmapped index lookup, which
came back into mind after hearing Ann Harrison describe how FireBird/
InterBase does it.
The idea is that you don't scan the index and base table concurrently
as we presently do it. Instead, you scan the index and make a list
of the TIDs of the table tuples you need to visit. This list can
be conveniently represented as a sparse bitmap. After you've finished
looking at the index, you visit all the required table tuples *in
physical order* using the bitmap. This eliminates multiple fetches
of the same heap page, and can possibly let you get some win from
sequential access.
Once you have built this mechanism, you can then move on to using
multiple indexes in interesting ways: you can do several indexscans
in one query and then AND or OR their bitmaps before doing the heap
scan. This would allow, for example, "WHERE a = foo and b = bar"
to be handled by ANDing results from separate indexes on the a and b
columns, rather than having to choose only one index to use as we do
now.
Some thoughts about implementation: FireBird's implementation seems
to depend on an assumption about a fixed number of tuple pointers
per page. We don't have that, but we could probably get away with
just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits. (It's
interesting to think of this as lossy compression of the bitmap...
which leads to the idea of only being fuzzy in limited areas of the
bitmap, rather than losing all the information you have.)
A possibly nasty issue is that lazy VACUUM has some assumptions in it
about indexscans holding pins on index pages --- that's what prevents
it from removing heap tuples that a concurrent indexscan is just about
to visit. It might be that there is no problem: even if lazy VACUUM
removes a heap tuple and someone else then installs a new tuple in that
same TID slot, you should be okay because the new tuple is too new to
pass your visibility test. But I'm not convinced this is safe.
regards, tom lane
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
On Wed, 7 Aug 2002, Tom Lane wrote:
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks.But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.
When the tuples are small relative to pagesize, you may get some
"compression" by saving just pages and not the actual tids in the the
bitmap.
-------------
Hannu
Import Notes
Reply to msg id not found: 12776.1028697148@sss.pgh.pa.usReference msg id not found: Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.netReference msg id not found: 12776.1028697148@sss.pgh.pa.us | Resolved by subject fallback
On Wed, 7 Aug 2002, Tom Lane wrote:
I thought to myself "didn't I just post something about that?"
and then realized it was on a different mailing list. Here ya go
(and no, this is not the first time around on this list either...)
Wow. I'm glad to see you looking at this, because this feature would so
*so* much for the performance of some of my queries, and really, really
impress my "billion-row-database" client.
The idea is that you don't scan the index and base table concurrently
as we presently do it. Instead, you scan the index and make a list
of the TIDs of the table tuples you need to visit.
Right.
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.
Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks. If you want to limit to,
say, 1000 TIDs, and your index has 3000, just do the first 1000, then
the next 1000, then the last 1000. This would still result in much less
disk head movement and speed the query immensely.
(BTW, I have verified this emperically during testing of random read vs.
random write on a RAID controller. The writes were 5-10 times faster
than the reads because the controller was caching a number of writes and
then doing them in the best possible order, whereas the reads had to be
satisfied in the order they were submitted to the controller.)
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
On Wed, 7 Aug 2002, Tom Lane wrote:
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.
Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks.
But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.
Another point to keep in mind is that the bigger the bitmap gets, the
less useful an indexscan is, by definition --- sooner or later you might
as well fall back to a seqscan. So the idea of lossy compression of a
large bitmap seems really ideal to me. In principle you could seqscan
the parts of the table where matching tuples are thick on the ground,
and indexscan the parts where they ain't. Maybe this seems natural
to me as an old JPEG campaigner, but if you don't see the logic I
recommend thinking about it a little ...
regards, tom lane
On Wed, 7 Aug 2002, Tom Lane wrote:
But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.
Right. It'd be a shame to lose that, but a little is better than nothing
at all, if one ends up being faced with that decision.
Another point to keep in mind is that the bigger the bitmap gets, the
less useful an indexscan is, by definition --- sooner or later you might
as well fall back to a seqscan.
Well, yes, so long as you chose the correct values of "big." I'd want
this to be able to optimize queries against a two billion row table
about 150 GB in size. And that might even get bigger in a few years.
Maybe this seems natural
to me as an old JPEG campaigner, but if you don't see the logic I
recommend thinking about it a little ...
Well, photos are certainly not random, but database tables may be
in essentially random order far more often. How much that applies,
I'm not sure, since I don't really know a lot about this stuff.
I'll take your word for it on what's best there.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Hannu Krosing <hannu@tm.ee> writes:
Now I remembered my original preference for page bitmaps (vs. tuple
bitmaps): one can't actually make good use of a bitmap of tuples because
there is no fixed tuples/page ratio and thus no way to quickly go from
bit position to actual tuple. You mention the same problem but propose a
different solution.
Using page bitmap, we will at least avoid fetching any unneeded pages -
essentially we will have a sequential scan over possibly interesting
pages.
Right. One form of the "lossy compression" idea I suggested is to
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
too large to work with. Again, one could imagine doing that only in
denser areas of the bitmap.
But I guess that CLUSTER support for INSERT will not be touched for 7.3
as will real bitmap indexes ;)
All of this is far-future work I think. Adding a new scan type to the
executor would probably be pretty localized, but the ramifications in
the planner could be extensive --- especially if you want to do plans
involving ANDed or ORed bitmaps.
regards, tom lane
Import Notes
Reply to msg id not found: 1028726966.13418.12.camel@taru.tm.ee
On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote:
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
On Wed, 7 Aug 2002, Tom Lane wrote:
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks.But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.When the tuples are small relative to pagesize, you may get some
"compression" by saving just pages and not the actual tids in the the
bitmap.
Now I remembered my original preference for page bitmaps (vs. tuple
bitmaps): one can't actually make good use of a bitmap of tuples because
there is no fixed tuples/page ratio and thus no way to quickly go from
bit position to actual tuple. You mention the same problem but propose a
different solution.
Using page bitmap, we will at least avoid fetching any unneeded pages -
essentially we will have a sequential scan over possibly interesting
pages.
If we were to use page-bitmap index for something with only a few values
like booleans, some insert-time local clustering should be useful, so
that TRUEs and FALSEs end up on different pages.
But I guess that CLUSTER support for INSERT will not be touched for 7.3
as will real bitmap indexes ;)
---------------
Hannu