CLUSTER and indisclustered
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.
1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scan
Any reason this isn't possible? Any reason it couldn't dramatically speed
up the performance of the type of query i've mentioned?
Gavin
Gavin Sherry <swm@linuxworld.com.au> writes:
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans.
AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
that's only usable on a perfectly-clustered table.
The existing approach to making the planner smart about clustered tables
is to compute a physical-vs-logical-order-correlation statistic and use
that to adjust the estimated cost of indexscans. I believe this is a
more robust approach than considering a table to be "clustered" or "not
clustered", since it can deal with the gradual degradation of clustered
order over time. However, I will not make any great claims for the
specific equations currently used for this purpose --- they're surely in
need of improvement. Feel free to take a look and see if you have any
ideas. The collection of the statistic is in commands/analyze.c and the
use of it is in optimizer/path/costsize.c.
regards, tom lane
On Sat, 3 Aug 2002, Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans.AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
Sorry, I meant to say that heap_insert() etc would need to set
indisclustered to false.
I do see some worth in this however. Naturally, in a situation where a
database is being modified very often this is of little value. However,
for applications focussed on analysing large amounts of static data this
could increase performance significantly. Once I get some time I will
attempt to explore this further in `diff -c` format :-).
Gavin
Tom Lane wrote:
Gavin Sherry <swm@linuxworld.com.au> writes:
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans.AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
that's only usable on a perfectly-clustered table.The existing approach to making the planner smart about clustered tables
is to compute a physical-vs-logical-order-correlation statistic and use
that to adjust the estimated cost of indexscans. I believe this is a
more robust approach than considering a table to be "clustered" or "not
clustered", since it can deal with the gradual degradation of clustered
order over time. However, I will not make any great claims for the
specific equations currently used for this purpose --- they're surely in
need of improvement. Feel free to take a look and see if you have any
ideas. The collection of the statistic is in commands/analyze.c and the
use of it is in optimizer/path/costsize.c.
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Gavin Sherry wrote:
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scan
Gavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Sat, 3 Aug 2002, Bruce Momjian wrote:
Gavin Sherry wrote:
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scanGavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.
Bruce,
It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.
Gavin
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?
Could do that I suppose, but I'm not super-excited about it. ANALYZE is
quite cheap these days (especially in comparison to CLUSTER ;-)). I'd
settle for a note in the CLUSTER docs that recommends a subsequent
ANALYZE --- this seems no different from recommending ANALYZE after bulk
data load or other major update of a table.
regards, tom lane
Gavin Sherry wrote:
Gavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.Bruce,
It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.
Yes, but in a clustered table, an index scan is _never_ (?) more
expensive than a sequential scan, at least if the optimizer is working
correctly. Index scans are slower only because they assume random heap
access, but with a clustered table, there is no random heap access. The
index takes to right to the spot to start.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?Could do that I suppose, but I'm not super-excited about it. ANALYZE is
quite cheap these days (especially in comparison to CLUSTER ;-)). I'd
settle for a note in the CLUSTER docs that recommends a subsequent
ANALYZE --- this seems no different from recommending ANALYZE after bulk
data load or other major update of a table.
OK. I am sure it is not obvious to people to ANALYZE because the data
in their table hasn't changed, just the ordering.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Gavin Sherry <swm@linuxworld.com.au> writes:
On Sat, 3 Aug 2002, Tom Lane wrote:
AFAICT you're assuming that the table is *exactly* ordered by the
clustered attribute. While this is true at the instant CLUSTER
completes, the exact ordering will be destroyed by the first insert or
update :-(. I can't see much value in creating a whole new scan type
Sorry, I meant to say that heap_insert() etc would need to set
indisclustered to false.
<<itch>> You could do that, but only if you are prepared to invent
a mechanism that will instantly invalidate any existing query plans
that assume the clustered ordering is good.
Up to now we've only allowed the planner to make decisions that impact
performace, not correctness of the result. I'm uncomfortable with the
idea that a "clusterscan" plan could silently return wrong answers after
someone else updates the table and doesn't tell us they did.
regards, tom lane
Gavin Sherry wrote:
On Sat, 3 Aug 2002, Bruce Momjian wrote:
Gavin Sherry wrote:
Hi all,
It occured to me on the plane home that now that CLUSTER is fixed we may
be able to put pg_index.indisclustered to use. If CLUSTER was to set
indisclustered to true when it clusters a heap according to the given
index, we could speed up sequantial scans. There are two possible ways.1) Planner determines that a seqscan is appropriate *and* the retrieval is
qualified by the key(s) of one of the relation's indexes
2) Planner determines that the relation is clustered on disk according to
the index over the key(s) used to qualify the retrieval
3) Planner sets an appropriate nodeTag for the retrieval (SeqScanCluster?)
4) ExecProcNode() calls some new scan routine, ExecSeqScanCluster() ?
5) ExecSeqScanCluster() calls ExecScan() with a new ExecScanAccessMtd (ie,
different from SeqNext) called SeqClusterNext
6) SeqClusterNext() has all the heapgettup() logic with two
exceptions: a) we find the first tuple more intelligently (instead of
scanning from the first page) b) if we have found tuple(s) matching the
ScanKey when we encounter an non-matching tuple (via
HeapTupleSatisfies() ?) we return a NULL'ed out tuple, terminating the
scanGavin, is that a big win compared to just using the index and looping
through the entries, knowing that the index matches are on the same
page, and the heap matches are on the same page.Bruce,
It would cut out the index over head. Besides at (1) (above) we would have
determined that an index scan was too expensive and we would be using a
SeqScan instead. This would just be faster, since a) we would locate the
tuples more intelligently b) we wouldn't need to scan the whole heap once
we'd found all tuples matching the scan key.Gavin
Gavin and Bruce,
I am not so sure index access in these cases is such an overhead - since
the clustered nature of the table means that many index elements will
point to table data that is located in a few pages .
Ok, this change would save you the initial access of the index structure
itself - but isnt
the usual killer for indexes is the "thrashing" that happens when the
"pointed to" table data is spread over a many pages.
I did some tests on this a while ago ( 7.1 era) and discovered that for
a clustered table a sequential scan did not start to win until the
target dataset was ~30% of the table itself.
While the suggested change would of course mean that seq scans would do
much better than they did in my tests, it would be interesting to see
some timings...
regards
Mark
On Sun, 4 Aug 2002, mark Kirkwood wrote:
Ok, this change would save you the initial access of the index
structure itself - but isnt the usual killer for indexes is the
"thrashing" that happens when the "pointed to" table data is spread
over a many pages.
Yeah, no kidding on this one. I've reduced queries from 75 seconds
to 0.6 seconds by clustering on the appropriate field.
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to? That would introduce a certain amount
of "sequentialness" to the reads that the OS is not capable of
introducing (since it can't reschedule the reads you're doing, the
way it could reschedule, say, random writes).
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
On Wed, 7 Aug 2002, Curt Sampson wrote:
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to? That would introduce a certain amount
of "sequentialness" to the reads that the OS is not capable of
introducing (since it can't reschedule the reads you're doing, the
way it could reschedule, say, random writes).
This sounds more or less like the method employed by Firebird as described
by Ann Douglas to Tom at OSCON (correct me if I get this wrong).
Basically, firebird populates a bitmap with entries the scan is interested
in. The bitmap is populated in page order so that all entries on the same
heap page can be fetched at once.
This is totally different to the way postgres does things and would
require significant modification to the index access methods.
Gavin
Curt Sampson <cjs@cynic.net> writes:
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to?
I thought to myself "didn't I just post something about that?"
and then realized it was on a different mailing list. Here ya go
(and no, this is not the first time around on this list either...)
I am currently thinking that bitmap indexes per se are not all that
interesting. What does interest me is bitmapped index lookup, which
came back into mind after hearing Ann Harrison describe how FireBird/
InterBase does it.
The idea is that you don't scan the index and base table concurrently
as we presently do it. Instead, you scan the index and make a list
of the TIDs of the table tuples you need to visit. This list can
be conveniently represented as a sparse bitmap. After you've finished
looking at the index, you visit all the required table tuples *in
physical order* using the bitmap. This eliminates multiple fetches
of the same heap page, and can possibly let you get some win from
sequential access.
Once you have built this mechanism, you can then move on to using
multiple indexes in interesting ways: you can do several indexscans
in one query and then AND or OR their bitmaps before doing the heap
scan. This would allow, for example, "WHERE a = foo and b = bar"
to be handled by ANDing results from separate indexes on the a and b
columns, rather than having to choose only one index to use as we do
now.
Some thoughts about implementation: FireBird's implementation seems
to depend on an assumption about a fixed number of tuple pointers
per page. We don't have that, but we could probably get away with
just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits. (It's
interesting to think of this as lossy compression of the bitmap...
which leads to the idea of only being fuzzy in limited areas of the
bitmap, rather than losing all the information you have.)
A possibly nasty issue is that lazy VACUUM has some assumptions in it
about indexscans holding pins on index pages --- that's what prevents
it from removing heap tuples that a concurrent indexscan is just about
to visit. It might be that there is no problem: even if lazy VACUUM
removes a heap tuple and someone else then installs a new tuple in that
same TID slot, you should be okay because the new tuple is too new to
pass your visibility test. But I'm not convinced this is safe.
regards, tom lane
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
On Wed, 7 Aug 2002, Tom Lane wrote:
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks.But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.
When the tuples are small relative to pagesize, you may get some
"compression" by saving just pages and not the actual tids in the the
bitmap.
-------------
Hannu
Import Notes
Reply to msg id not found: 12776.1028697148@sss.pgh.pa.usReference msg id not found: Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.netReference msg id not found: 12776.1028697148@sss.pgh.pa.us | Resolved by subject fallback
On Wed, 7 Aug 2002, Tom Lane wrote:
I thought to myself "didn't I just post something about that?"
and then realized it was on a different mailing list. Here ya go
(and no, this is not the first time around on this list either...)
Wow. I'm glad to see you looking at this, because this feature would so
*so* much for the performance of some of my queries, and really, really
impress my "billion-row-database" client.
The idea is that you don't scan the index and base table concurrently
as we presently do it. Instead, you scan the index and make a list
of the TIDs of the table tuples you need to visit.
Right.
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.
Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks. If you want to limit to,
say, 1000 TIDs, and your index has 3000, just do the first 1000, then
the next 1000, then the last 1000. This would still result in much less
disk head movement and speed the query immensely.
(BTW, I have verified this emperically during testing of random read vs.
random write on a RAID controller. The writes were 5-10 times faster
than the reads because the controller was caching a number of writes and
then doing them in the best possible order, whereas the reads had to be
satisfied in the order they were submitted to the controller.)
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Curt Sampson <cjs@cynic.net> writes:
On Wed, 7 Aug 2002, Tom Lane wrote:
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.
Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks.
But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.
Another point to keep in mind is that the bigger the bitmap gets, the
less useful an indexscan is, by definition --- sooner or later you might
as well fall back to a seqscan. So the idea of lossy compression of a
large bitmap seems really ideal to me. In principle you could seqscan
the parts of the table where matching tuples are thick on the ground,
and indexscan the parts where they ain't. Maybe this seems natural
to me as an old JPEG campaigner, but if you don't see the logic I
recommend thinking about it a little ...
regards, tom lane
On Wed, 7 Aug 2002, Tom Lane wrote:
But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.
Right. It'd be a shame to lose that, but a little is better than nothing
at all, if one ends up being faced with that decision.
Another point to keep in mind is that the bigger the bitmap gets, the
less useful an indexscan is, by definition --- sooner or later you might
as well fall back to a seqscan.
Well, yes, so long as you chose the correct values of "big." I'd want
this to be able to optimize queries against a two billion row table
about 150 GB in size. And that might even get bigger in a few years.
Maybe this seems natural
to me as an old JPEG campaigner, but if you don't see the logic I
recommend thinking about it a little ...
Well, photos are certainly not random, but database tables may be
in essentially random order far more often. How much that applies,
I'm not sure, since I don't really know a lot about this stuff.
I'll take your word for it on what's best there.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC
Hannu Krosing <hannu@tm.ee> writes:
Now I remembered my original preference for page bitmaps (vs. tuple
bitmaps): one can't actually make good use of a bitmap of tuples because
there is no fixed tuples/page ratio and thus no way to quickly go from
bit position to actual tuple. You mention the same problem but propose a
different solution.
Using page bitmap, we will at least avoid fetching any unneeded pages -
essentially we will have a sequential scan over possibly interesting
pages.
Right. One form of the "lossy compression" idea I suggested is to
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
too large to work with. Again, one could imagine doing that only in
denser areas of the bitmap.
But I guess that CLUSTER support for INSERT will not be touched for 7.3
as will real bitmap indexes ;)
All of this is far-future work I think. Adding a new scan type to the
executor would probably be pretty localized, but the ramifications in
the planner could be extensive --- especially if you want to do plans
involving ANDed or ORed bitmaps.
regards, tom lane
Import Notes
Reply to msg id not found: 1028726966.13418.12.camel@taru.tm.ee
On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote:
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
On Wed, 7 Aug 2002, Tom Lane wrote:
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits.Well, what I was thinking of, should the list of TIDs to fetch get too
long, was just to break it down in to chunks.But then you lose the possibility of combining multiple indexes through
bitmap AND/OR steps, which seems quite interesting to me. If you've
visited only a part of each index then you can't apply that concept.When the tuples are small relative to pagesize, you may get some
"compression" by saving just pages and not the actual tids in the the
bitmap.
Now I remembered my original preference for page bitmaps (vs. tuple
bitmaps): one can't actually make good use of a bitmap of tuples because
there is no fixed tuples/page ratio and thus no way to quickly go from
bit position to actual tuple. You mention the same problem but propose a
different solution.
Using page bitmap, we will at least avoid fetching any unneeded pages -
essentially we will have a sequential scan over possibly interesting
pages.
If we were to use page-bitmap index for something with only a few values
like booleans, some insert-time local clustering should be useful, so
that TRUEs and FALSEs end up on different pages.
But I guess that CLUSTER support for INSERT will not be touched for 7.3
as will real bitmap indexes ;)
---------------
Hannu
Hannu Krosing <hannu@tm.ee> writes:
On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
Right. One form of the "lossy compression" idea I suggested is to
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
too large to work with.
If it is a real bitmap, should it not be easyeast to allocate at the
start ?
But it isn't a "real bitmap". That would be a really poor
implementation, both for space and speed --- do you really want to scan
over a couple of megs of zeroes to find the few one-bits you care about,
in the typical case? "Bitmap" is a convenient term because it describes
the abstract behavior we want, but the actual data structure will
probably be nontrivial. If I recall Ann's description correctly,
Firebird's implementation uses run length coding of some kind (anyone
care to dig in their source and get all the details?). If we tried
anything in the way of lossy compression then there'd be even more stuff
lurking under the hood.
regards, tom lane
Import Notes
Reply to msg id not found: 1028733234.13418.113.camel@taru.tm.ee
On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
Hannu Krosing <hannu@tm.ee> writes:
Now I remembered my original preference for page bitmaps (vs. tuple
bitmaps): one can't actually make good use of a bitmap of tuples because
there is no fixed tuples/page ratio and thus no way to quickly go from
bit position to actual tuple. You mention the same problem but propose a
different solution.Using page bitmap, we will at least avoid fetching any unneeded pages -
essentially we will have a sequential scan over possibly interesting
pages.Right. One form of the "lossy compression" idea I suggested is to
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
too large to work with.
If it is a real bitmap, should it not be easyeast to allocate at the
start ?
a page bitmap for a 100 000 000 tuple table with 10 tuples/page will be
sized 10000000/8 = 1.25 MB, which does not look too big for me for that
amount of data (the data table itself would occupy 80 GB).
Even having the bitmap of 16 bits/page (with the bits 0-14 meaning
tuples 0-14 and bit 15 meaning "seq scan the rest of page") would
consume just 20 MB of _local_ memory, and would be quite justifyiable
for a query on a table that large.
For a real bitmap index the tuples-per-page should be a user-supplied
tuning parameter.
Again, one could imagine doing that only in denser areas of the bitmap.
I would hardly call the resulting structure "a bitmap" ;)
And I'm not sure the overhead for a more complex structure would win us
any additional performance for most cases.
But I guess that CLUSTER support for INSERT will not be touched for 7.3
as will real bitmap indexes ;)All of this is far-future work I think.
After we do that we will probably be able claim support for
"datawarehousing" ;)
Adding a new scan type to the
executor would probably be pretty localized, but the ramifications in
the planner could be extensive --- especially if you want to do plans
involving ANDed or ORed bitmaps.
Also going to "smart inserter" which can do local clustering on sets of
real bitmap indexes for INSERTS (and INSERT side of UPDATE) would
probably be a major change from our current "stupid inserter" ;)
This will not be needed for bitmap resolution higher than 1bit/page but
default local clustering on bitmap indexes will probably buy us some
extra performance. by avoiding data page fetches when such indexes are
used.
AN anyway the support for INSERT being aware of clustering will probably
come up sometime.
------------
Hannu
On Wed, 2002-08-07 at 04:31, Curt Sampson wrote:
On Sun, 4 Aug 2002, mark Kirkwood wrote:
Ok, this change would save you the initial access of the index
structure itself - but isnt the usual killer for indexes is the
"thrashing" that happens when the "pointed to" table data is spread
over a many pages.Yeah, no kidding on this one. I've reduced queries from 75 seconds
to 0.6 seconds by clustering on the appropriate field.But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to? That would introduce a certain amount
of "sequentialness" to the reads that the OS is not capable of
introducing (since it can't reschedule the reads you're doing, the
way it could reschedule, say, random writes).
I guess this could be solved elegantly using threading - one thread
scans index and pushes tids into a btree or some other sorted structure,
while other thread loops continuously (or "elevatorly" back and forth)
over that structure in tuple order and does the actual data reads.
This would have the added benefit of better utilising multiprocessor
computers.
---------------
Hannu
On Wed, 2002-08-07 at 16:26, Tom Lane wrote:
Hannu Krosing <hannu@tm.ee> writes:
On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
Right. One form of the "lossy compression" idea I suggested is to
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
too large to work with.If it is a real bitmap, should it not be easyeast to allocate at the
start ?But it isn't a "real bitmap". That would be a really poor
implementation, both for space and speed --- do you really want to scan
over a couple of megs of zeroes to find the few one-bits you care about,
in the typical case?
I guess that depends on data. The typical case should be somthing the
stats process will find out so the optimiser can use it
The bitmap must be less than 1/48 (size of TID) full for best
uncompressed "active-tid-list" to be smaller than plain bitmap. If there
were some structure above list then this ratio would be even higher.
I have had good experience using "compressed delta lists", which will
scale well ofer the whole "fullness" spectrum of bitmap, but this is for
storage, not for initial constructing of lists.
"Bitmap" is a convenient term because it describes
the abstract behavior we want, but the actual data structure will
probably be nontrivial. If I recall Ann's description correctly,
Firebird's implementation uses run length coding of some kind (anyone
care to dig in their source and get all the details?).
Plain RLL is probably a good way to store it and for merging two or more
bitmaps, but not as good for constructing it bit-by-bit. I guess the
most effective structure for updating is often still a plain bitmap
(maybe not if it is very sparse and all of it does not fit in cache),
followed by some kind of balanced tree (maybe rb-tree).
If the bitmap is relatively full then the plain bitmap is almost always
the most effective to update.
If we tried anything in the way of lossy compression then there'd
be even more stuff lurking under the hood.
Having three-valued (0,1,maybe) RLL-encoded "tritmap" would be a good
way to represent lossy compression, and it would also be quite
straightforward to merge two of these using AND or OR. It may even be
possible to easily construct it using a fixed-length b-tree and going
from 1 to "maybe" for nodes that get too dense.
---------------
Hannu
Tom Lane dijo:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Tom, should we be updating that flag after we CLUSTER instead of
requiring an ANALYZE after the CLUSTER?Could do that I suppose, but I'm not super-excited about it. ANALYZE is
quite cheap these days (especially in comparison to CLUSTER ;-)). I'd
settle for a note in the CLUSTER docs that recommends a subsequent
ANALYZE --- this seems no different from recommending ANALYZE after bulk
data load or other major update of a table.
What if I [try to] extend the grammar to support an additional ANALYZE
in CLUSTER, so that it analyzes the table automatically? Say
CLUSTER <index> ON <table> [ANALYZE];
Or maybe just do an analyze of the table automatically after the
CLUSTERing.
What does everybody think?
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Para tener mas hay que desear menos"
Alvaro Herrera <alvherre@atentus.com> writes:
What if I [try to] extend the grammar to support an additional ANALYZE
in CLUSTER, so that it analyzes the table automatically?
I don't like this -- it seems like bloat. What's the advantage of
CLUSTER foo ON bar ANALYZE;
over
CLUSTER foo ON bar;
ANALYZE;
Or maybe just do an analyze of the table automatically after the
CLUSTERing.
Hmmm... I don't really see the problem with adding a note in the docs
suggesting that users following a CLUSTER with an ANALYZE (of course,
that assumes that the CLUSTER will significantly change the ordering
of the data in the table, which isn't always the case -- which is
another reason why make this automatic seems unwarranted, IMHO). It
seems like you're looking for a solution to a non-existent problem.
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
Neil Conway dijo:
Alvaro Herrera <alvherre@atentus.com> writes:
What if I [try to] extend the grammar to support an additional ANALYZE
in CLUSTER, so that it analyzes the table automatically?I don't like this -- it seems like bloat.
Maybe you are right.
Or maybe just do an analyze of the table automatically after the
CLUSTERing.Hmmm... I don't really see the problem with adding a note in the docs
suggesting that users following a CLUSTER with an ANALYZE (...).
ANALYZE is an inexpensive operation (compared to CLUSTER, anyway), so it
can't hurt to have it done automatically.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Linux transform� mi computadora, de una `m�quina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada d�a aprendo
algo nuevo" (Jaime Salinas)
Or maybe just do an analyze of the table automatically after the
CLUSTERing.Hmmm... I don't really see the problem with adding a note in the docs
suggesting that users following a CLUSTER with an ANALYZE (...).ANALYZE is an inexpensive operation (compared to CLUSTER, anyway), so it
can't hurt to have it done automatically.
Well we have previously had discussions on the topic of adding analyze to
the end of dumps, etc. and the result has always been in favour of keeping
the command set orthogonal and not doing an automatic analyze...
Chris
Christopher Kings-Lynne dijo:
Or maybe just do an analyze of the table automatically after the
CLUSTERing.Well we have previously had discussions on the topic of adding analyze to
the end of dumps, etc. and the result has always been in favour of keeping
the command set orthogonal and not doing an automatic analyze...
Oh. Sorry for the noise.
I'm trying to look at other things in the TODO so I stop pestering about
CLUSTER.
--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
s�lo le suma el nuevo terror de la locura" (Perelandra, CSLewis)
Well we have previously had discussions on the topic of adding
analyze to
the end of dumps, etc. and the result has always been in favour
of keeping
the command set orthogonal and not doing an automatic analyze...
Oh. Sorry for the noise.
I'm trying to look at other things in the TODO so I stop pestering about
CLUSTER.
All I can say is - thanks for fixing CLUSTER. As soon as we upgrade to 7.3
I'm going on a CLUSTERing spree :)
Chris
If you're looking for something very useful to work on, see if Gavin
Sherry(?) can post his old CREATE OR REPLACE VIEW code. I'm pretty sure he
(or someone) said that he had an old patch, that needed to be synced with
HEAD... This functionality is pretty essential for 7.3...
Chris
Show quoted text
-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@atentus.com]
Sent: Friday, 9 August 2002 10:21 AM
To: Christopher Kings-Lynne
Cc: Neil Conway; Tom Lane; Bruce Momjian; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] CLUSTER and indisclusteredChristopher Kings-Lynne dijo:
Or maybe just do an analyze of the table automatically after the
CLUSTERing.Well we have previously had discussions on the topic of adding
analyze to
the end of dumps, etc. and the result has always been in favour
of keeping
the command set orthogonal and not doing an automatic analyze...
Oh. Sorry for the noise.
I'm trying to look at other things in the TODO so I stop pestering about
CLUSTER.--
Alvaro Herrera (<alvherre[a]atentus.com>)
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
s�lo le suma el nuevo terror de la locura" (Perelandra, CSLewis)
Neil Conway <nconway@klamath.dyndns.org> writes:
Alvaro Herrera <alvherre@atentus.com> writes:
What if I [try to] extend the grammar to support an additional ANALYZE
in CLUSTER, so that it analyzes the table automatically?
I don't like this -- it seems like bloat.
My reaction exactly.
regards, tom lane
-----Original Message-----
From: Christopher Kings-Lynne [mailto:chriskl@familyhealth.com.au]
Sent: 09 August 2002 03:57
To: Alvaro Herrera
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] CLUSTER and indisclusteredIf you're looking for something very useful to work on, see if Gavin
Sherry(?) can post his old CREATE OR REPLACE VIEW code. I'm
pretty sure he (or someone) said that he had an old patch,
that needed to be synced with HEAD... This functionality is
pretty essential for 7.3...
I'll second that...
Dave.
Import Notes
Resolved by subject fallback
I wanted to comment on this bitmapped index discussion because I am
hearing a lot about star joins, data warehousing, and bitmapped indexes
recently.
It seems we have several uses for bitmapped indexes:
Do index lookups in sequential heap order
Allow joining of bitmapped indexes to construct arbitrary indexes
There is a web page about "star joins" used a lot in data warehousing,
where you don't know what queries are going to be required and what
indexes to create:
http://www.dbdomain.com/a100397.htm
They show some sample queries, which is good. Here is some
interesting text:
Star Transformation
If there are bitmap indexes on SALES_REP_ID, PRODUCT_ID, and
DEPARTMENT_ID in the SALES table, then Oracle can resolve the query
using merges of the bitmap indexes.
Because Oracle can efficiently merge multiple bitmap indexes, you can
create a single bitmap index on each of the foreign-key columns in the
fact table rather than on every possible combination of columns. This
lets you support all possible combinations of dimensions without
creating an unreasonable number of indexes.
Added to TODO:
* Use bitmaps to fetch heap pages in sequential order [performance]
* Use bitmaps to combine existing indexes [performance]
and I will add some of these emails to TODO.detail/performance.
---------------------------------------------------------------------------
Tom Lane wrote:
Curt Sampson <cjs@cynic.net> writes:
But after doing some benchmarking of various sorts of random reads
and writes, it occurred to me that there might be optimizations
that could help a lot with this sort of thing. What if, when we've
got an index block with a bunch of entries, instead of doing the
reads in the order of the entries, we do them in the order of the
blocks the entries point to?I thought to myself "didn't I just post something about that?"
and then realized it was on a different mailing list. Here ya go
(and no, this is not the first time around on this list either...)I am currently thinking that bitmap indexes per se are not all that
interesting. What does interest me is bitmapped index lookup, which
came back into mind after hearing Ann Harrison describe how FireBird/
InterBase does it.The idea is that you don't scan the index and base table concurrently
as we presently do it. Instead, you scan the index and make a list
of the TIDs of the table tuples you need to visit. This list can
be conveniently represented as a sparse bitmap. After you've finished
looking at the index, you visit all the required table tuples *in
physical order* using the bitmap. This eliminates multiple fetches
of the same heap page, and can possibly let you get some win from
sequential access.Once you have built this mechanism, you can then move on to using
multiple indexes in interesting ways: you can do several indexscans
in one query and then AND or OR their bitmaps before doing the heap
scan. This would allow, for example, "WHERE a = foo and b = bar"
to be handled by ANDing results from separate indexes on the a and b
columns, rather than having to choose only one index to use as we do
now.Some thoughts about implementation: FireBird's implementation seems
to depend on an assumption about a fixed number of tuple pointers
per page. We don't have that, but we could probably get away with
just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
Also, the main downside of this approach is that the bitmap could
get large --- but you could have some logic that causes you to fall
back to plain sequential scan if you get too many index hits. (It's
interesting to think of this as lossy compression of the bitmap...
which leads to the idea of only being fuzzy in limited areas of the
bitmap, rather than losing all the information you have.)A possibly nasty issue is that lazy VACUUM has some assumptions in it
about indexscans holding pins on index pages --- that's what prevents
it from removing heap tuples that a concurrent indexscan is just about
to visit. It might be that there is no problem: even if lazy VACUUM
removes a heap tuple and someone else then installs a new tuple in that
same TID slot, you should be okay because the new tuple is too new to
pass your visibility test. But I'm not convinced this is safe.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Tue, 2002-08-13 at 09:25, Bruce Momjian wrote:
There is a web page about "star joins" used a lot in data warehousing,
where you don't know what queries are going to be required and what
indexes to create:http://www.dbdomain.com/a100397.htm
They show some sample queries, which is good. Here is some
interesting text:Star Transformation
If there are bitmap indexes on SALES_REP_ID, PRODUCT_ID, and
DEPARTMENT_ID in the SALES table, then Oracle can resolve the query
using merges of the bitmap indexes.Because Oracle can efficiently merge multiple bitmap indexes, you can
create a single bitmap index on each of the foreign-key columns in the
fact table rather than on every possible combination of columns.
Another way to achive the similar result would be using segmented hash
indexes, where each column maps directly to some part of hash value.
This
lets you support all possible combinations of dimensions without
creating an unreasonable number of indexes.
-----------
Hannu