Re: [GENERAL] visualizing B-tree index coverage
Speaking of bitmapped indexes for PostgreSQL ...
Someone is working on adding bitmapped indexes into PostgreSQL:
http://new.foo-baz.com/en/index.php/archives/2004/12/06/id/25/
This also is interesting:
http://www.it.iitb.ac.in/~rvijay/dbms/proj/
-----Original Message-----
From: TJ O'Donnell [mailto:tjo@acm.org]
Sent: Wednesday, January 26, 2005 9:46 AM
To: Dann Corbit
Subject: Re: [GENERAL] visualizing B-tree index coverage
My indexed columns will never be used in joins with other tables.
They don't really represent useful data at all!
Instead, they are used only behind the scenes as an incomplete,
all-integer representation of the real string data.
Their purpose is to quickly (hence the index)
come up with a set of rows that actually needs to be searched using the
slower full search. For example, there will be many molecules that
contain >17 carbons, >4 nitrogens, >3 oxygens, >22 single bonds, etc.
but probably no more than 25%. So a slow search of 25% of the rows
is a 4x speedup (neglecting the cost of the index scan in the first
place).
My use of the index is also rarely for an exact lookup, where I'd want
to find compounds with exactly 17C, 4N, 3O, etc. atoms. Instead,
it is >= in 99% of the cases, since I am looking for structures which
contain my substructure.
It might be similar to an ordinary string search where an index is
created
not just on the initial letter of each word, but also the frequency of
vowels, consonants, number of doubled-letters, etc. within each word.
But in my case, the letters(atoms) are not always in a fixed order
within
the word(molecule).
I think my goal is to find an optimal set of columns. There will always
be
pathological cases where the indexing is not helpful. For example, if
the user searches for 'C', that means any structure containing at least
one carbon atom. That is well over 95% of the database. Indexing won't
help there. These types of searches are silly, really, and will
not ever be seen in the real world.
So, the more complex my substructure pattern is, the fewer rows I'll
have
to search....that is if my multi-column index sufficiently represents
the
complexity of the data.
TJ
Dann Corbit wrote:
It is certainly an interesting problem and it does sound a bit
atypical.
Two things are important for the indexes:
1. Join columns.
2. Where clause references.For join columns, if there is an index on all the columns in a join,
or
on the most significant columns in a join it will significantly impact
performance.For instance,
SELECT * from a, b
WHERE a.c1 = b.c1 and a.c2 = b.c2 and a.c3 = b.c3 and a.c5 = b.c5If we have an index on either of the tables on
(c5, c3, c4, c1)
Then the index will seek to c5,c3 and greatly reduce the work. Since
c4
is missing, that is the end of the value and if c4 were added to the
index it would become more effective.For a where clause, if the most significant columns in an index are
contained in the where clause, then the index will be used.For example, if I have a table a with an index on (c1, c2, c3, c4)
And I have a query:
SELECT * FROM a where c1 = 'Joe' AND c2 = 'Smith' and c4 =
'555-11-1234'
Then c1 and c2 will be used from the index to filter the search
greatly.
But this query:
SELECT * FROM a where c2 = 'Smith' and c3 = 'MS' and c4 =
'555-11-1234'
Will not use the index at all, because c1 is the most significant
column
Show quoted text
of the index and it is not used in the where clause.
-----Original Message-----
From: TJ O'Donnell [mailto:tjo@acm.org]
Sent: Tuesday, January 25, 2005 7:28 PM
To: Dann Corbit
Subject: Re: [GENERAL] visualizing B-tree index coverageI think I am using indexes differently than most people.
For example, there will NEVER be a search directly on
an indexed column (or columns). It will only be used in
conjunction with a search on the smiles/structure column
to aid in eliminating rows which could never match using the
true match function. And I will ALWAYS use
every column of the multi-column index, in "left to right"
order, when I use the index at all. If some counts are 0,
for example 0 nitrogen atoms, that is as useful a guide as
a number > 0, so my index will be something like (1,2,0,0,0,8,4,...)This seems to argue for my using one multi-column index
or reasonable (15-20?) size. Using 13 helps lots. Using
2 does not help much. Aside from that, I guess I'll have to
experiment more.Thanks for your insights, Dann.
TJ
Dann Corbit wrote:
-----Original Message-----
From: TJ [mailto:tjo@acm.org]
Sent: Tuesday, January 25, 2005 5:55 PM
To: Dann Corbit
Subject: Re: [GENERAL] visualizing B-tree index coverageIs it highly desirable to have as
few rows as possible with identical index, even if I'm not using an
index for a unique lookup (or at least very rarely doing that)?There, I cannot say. I would benchmark more than one approach.
<<I understand the great advantage of a unique index when a lookup
of one particular row is desired. Most of my searches, using my
oe_matches functions are expected to return many rows which
match, kind of like a regexp string match or even a LIKE, but
more complex for molecular structures. I am basically counting
on the index to rule out using the expensive search for as many
rows as possible, yet include all rows that would match after
an exhaustive search of the whole table.Unfortuately, I don't know which queries my user's run most often.
Think of it, again, like a regexp search. You can't tell what kinds
of words or sub-words a user might be interested in - in my case
molecular fragments or substructures.I think I should expand my multi-column index up to the limit
of 32 columns as a limiting case. This would almost surely make
a unique index for each row. Then if my timings are much improved,
I could go with that.If you have to go to 32 columns, I doubt if there will be much
benefit.
The staggering size of an index entry will be too much.
If you can add one or two columns to what you have, that would be
better, perhaps. But the extra columns will help if and only if they
are actually used in the query and if all the preceding columns arealso
used.
<<With my current 13 column index, in a 250,000 row table, there are
193 rows with an identical index. That is the maximum set of rows
with identical index. There are many other sets, most with about
10 rows having identical index. Again, is it highly desirable to have
as
few rows as possible with identical index, even if I'm not using an
index for a unique lookup (or at least very rarely doing that)?If you have to add a large number of columns, it will probably be
worse
than not doing it.
<<As for updates and deletes, these are relatively rare. The db
would grow (INSERT) by less than 1% per day. There would be virtually
no
DELETES, perhaps some UPDATES. The inserts would typically happen
overnight and could be coordinates with a re-indexing if needed.
Even dropping the index and re-creating it would be workable.
I expect 1-3 million rows would be the maximum.This is important, then.
The most important column in any index is the first column. So, for
every column that you think will receive a large volume oftransactions
it might be good to add a single column index. And if you know of
some
columns that will be used very frequently in combination in queries,
it
would be good to make multiple (but shorter) indexes on these.
If you create a large, compound index, and you do not include the
first
column of the index in your query, then the index is useless.
So (for instance) if I made an index on (protons, neutrons) and
someone
made a query:
SELECT * FROM atoms WHERE neutrons = 127
The index would not be used at all.Therefore, a number of auxiliary indexes may turn out to be a nice
benefit for performance.Probably, the statistics collector will tell you which ones are the
most
valuable over time.
So maybe, when you are done, you will have one index with 15 columns,
5
indexes with 5 columns, 8 indexes with 3 columns, and 10 indexes with
2
columns. As long as the transaction load for updates is light, a huge
number of indexes would not be too painful. But projections,
experiments and measurements are the only real way to know what isbest.
<<