An idea on faster CHAR field indexing
Hi folks,
This is my first post to your list. I've been reading it for about a week. I like the quality of the developers here and think
this portends well for the future of Postgres.
Anyway, an idea. Not sure if RDBMSs internally already implement this technique. But in case them don't and in case
you've never thought of it here something I just thought of:
CHAR fields have different sorting (aka collation) rules for each code page. eg the very fact that A comes before B is
something that the collation info for a given code page has to specify. Well, just because a character has a lower value
than another character in its encoding in a given code page doesn't mean it gets sorted first.
So let me cut to the chase: I'm thinking that rather than store the actual character sequence of each field (or some
subset of a field) in an index why not translate the characters into their collation sequence values and store _those_ in
the index?
The idea is to reduce the number of times that string has to be converted to its mathematical sorting order representation.
Don't do it every time two strings get compared. Do it when a record is inserted or that field is updated.
Is this already done? Or is it not such a good idea for some reason?
I'd consider this idea of greater value in something like Unicode. For 16 bit Unicode the lookup table to find each
character's ordinal value (or sorting value, whatever its called) is 128k, right? Doing a bunch of look-ups into that has to
not be good for L1 and L2 cache in a processor.
So let me cut to the chase: I'm thinking that rather than store the
actual character sequence of each field (or some subset of a field)
in an index why not translate the characters into their collation
sequence values and store _those_ in the index?
This is not an obvious win, since:
1. some collations rules require multiple passes over the data
2. POSIX strxfrm() will convert strings of characters to a form that
can be compared by strcmp() [i.e. single pass] but tends to greatly
increase memory requirements
I've only data for one implementation of strxfrm(), but the memory
usage startled me. In my application it was faster to use
strcoll() directly for collation than to pre-expand the data with
strxfrm().
Regards,
Giles
Giles,
I'm curious as to why the need for multiple passes. Is that true even in Latin 1 code pages? If not, this optimization
could at least be used for code pages that don't require multiple passes.
As for memory usage: I don't see the issue here. The translation to some collation sequence has to be done anyhow.
Writing one's own routine to do look-ups into a collation sequence table is a fairly
trivial exercise.
One would have the option with SBCS code pages to either translate to 8 bit collation values or to translate them into
master Unicode collation values. Not sure what the advantage would be of doing the
latter. I only see it as useful if you have different rows storing text in different code pages and then only if the RDBMS
can know for a given field on a per row basis what its code page is.
On Thu, 22 Jun 2000 06:59:06 +1000, Giles Lean wrote:
Show quoted text
So let me cut to the chase: I'm thinking that rather than store the
actual character sequence of each field (or some subset of a field)
in an index why not translate the characters into their collation
sequence values and store _those_ in the index?This is not an obvious win, since:
1. some collations rules require multiple passes over the data
2. POSIX strxfrm() will convert strings of characters to a form that
can be compared by strcmp() [i.e. single pass] but tends to greatly
increase memory requirementsI've only data for one implementation of strxfrm(), but the memory
usage startled me. In my application it was faster to use
strcoll() directly for collation than to pre-expand the data with
strxfrm().Regards,
Giles
Import Notes
Resolved by subject fallback
I'm curious as to why the need for multiple passes. Is that true
even in Latin 1 code pages?
Yes. Some locales want strings to be ordered first by ignoring any
accents on chracters, then using a tie-break on equal strings by doing
a comparison that includes the accents.
To take another of your points out of order: this is an obstacle that
Unicode doesn't resolve. Unicode gives you a character set capable of
representing characters from many different locales, but collation
order will remain locale specific.
If not, this optimization could at least
be used for code pages that don't require multiple passes.
... but due to the increased memory/disk space, this is likely not an
optimisation. Measurements needed, I'd suggest.
My only experience of this was tuning a sort utility, where the extra
time to convert the strings with strxfrm() and the large additional
memory requirement killed any advantage strcmp() had over strcoll().
Whether this would be the case for database indexes in general or
ideed ever I don't know.
As for memory usage: I don't see the issue here. The translation to
some collation sequence has to be done anyhow.
No; you can do the comparisons in multiple passes instead without
extra storage allocation. Using multiple passes will be efficient if
the comparisons mostly don't need the second pass, which I suspect is
typical.
Writing one's own routine to do look-ups into a collation sequence
table is a fairly trivial exercise.
True. But if you can't do character-by-character comparisons then
such a simplistic implementation will fail.
I hadn't mentioned this time around (but see the archives for the
recent discussion of LIKE) that there are locales with 2:1 and 1:2
mappings of characters too.
Regards,
Giles
Giles,
On Thu, 22 Jun 2000 11:12:54 +1000, Giles Lean wrote:
Yes. Some locales want strings to be ordered first by ignoring any
accents on chracters, then using a tie-break on equal strings by doing
a comparison that includes the accents.
I guess I don't see how this is really any different. Why order first by the character and second by the accent? For instance,
if you know the relative order of the various forms of "o" then just give them all successive numbers and do a single pass
sort. You just have to make sure that all the numbers in that set of numbers are greater than the number you assign to "m"
and less than the number you assign to "p".
To take another of your points out of order: this is an obstacle that
Unicode doesn't resolve. Unicode gives you a character set capable of
representing characters from many different locales, but collation
order will remain locale specific.
With Unicode you have to have a collation order that cuts across what use to be separate character sets in separate code
pages.
... but due to the increased memory/disk space, this is likely not an
optimisation. Measurements needed, I'd suggest.
But why is there increased memory and disk space? Do the fields that go into an index not now already get stored twice?
Does the index just contain a series of references to records and that is it?
Import Notes
Resolved by subject fallback
"Randall Parker" <randall@nls.net> writes:
On Thu, 22 Jun 2000 11:12:54 +1000, Giles Lean wrote:
Yes. Some locales want strings to be ordered first by ignoring any
accents on chracters, then using a tie-break on equal strings by doing
a comparison that includes the accents.
I guess I don't see how this is really any different. Why order first
by the character and second by the accent? For instance, if you know
the relative order of the various forms of "o" then just give them all
successive numbers and do a single pass sort. You just have to make
sure that all the numbers in that set of numbers are greater than the
number you assign to "m" and less than the number you assign to "p".
Nope. Would it were that easy. I don't have a keyboard that will
let me type a proper example, but consider
1. a o-with-type-1-accent c
2. a o-with-type-2-accent b
If type-1 accent sorts before type-2 then your proposal will consider
string 1 less than string 2. But the correct answer (in these locales)
is the other way round, because you mustn't look at the accents at all
unless you discover that the strings are otherwise equal. The
determining comparison here is that b < c, therefore string 2 < string 1.
regards, tom lane
Giles Lean <giles@nemeton.com.au> writes:
My only experience of this was tuning a sort utility, where the extra
time to convert the strings with strxfrm() and the large additional
memory requirement killed any advantage strcmp() had over strcoll().
Whether this would be the case for database indexes in general or
ideed ever I don't know.
Interesting. That certainly suggests strxfrm could be a loser for
a database index too, but I agree it'd be nice to see some actual
measurements rather than speculation.
What locale(s) were you using when testing your sort code? I suspect
the answers might depend on locale quite a bit...
regards, tom lane
Interesting. That certainly suggests strxfrm could be a loser for
a database index too, but I agree it'd be nice to see some actual
measurements rather than speculation.What locale(s) were you using when testing your sort code? I suspect
the answers might depend on locale quite a bit...
I did a little more measurement today. It's still only annecdotal
evidence -- I wasn't terribly rigorous -- but here are my results.
My data file consisted of ~660,000 lines and a total size of ~200MB.
Each line had part descriptions in German and some uninteresting
fields. I stripped out the uninteresting fields and read the file
calling calling strxfrm() for each line. I recorded the total input
bytes and the total bytes returned by strxfrm().
HP-UX 11.00 de_DE.roman8 locale:
input bytes: 179647811
result bytes: 1447833496 (increase factor 8.05)
Solaris 2.6 de_CH locale:
input bytes: 179647811
result bytes: 1085875122 (increase factor 6.04)
I didn't time the test program on Solaris, but on HP-UX this program
took longer to run than a simplistic qsort() using strcoll() does, and
my comparison sort program has to write the data out as well, which
the strxfrm() calling program didn't do.
Regards,
Giles
[ I hope this is useful for the archives, even though it's pretty much
been covered already. --giles ]
With Unicode you have to have a collation order that cuts across
what use to be separate character sets in separate code pages.
From the glossary at http://www.unicode.org/glossary/index.html:
Collation
The process of ordering units of textual information. Collation is
usually specific to a particular language. Also known as
alphabetizing or alphabetic sorting. Unicode Technical Report #10,
"Unicode Collation Algorithm," defines a complete, unambiguous,
specified ordering for all characters in the Unicode Standard.
The "Unicode Technical Report #10" is accessible at:
http://www.unicode.org/unicode/reports/tr10/
This technical report provides a way to specify "the tailoring for any
particular country" which together with the standard ordering will
allow different Unicode implementations to sort any particular input
identically.
Summary is that Unicode still has locales and we still have to know
not only the character set/code page but also the locale ("country
specific tailoring") an index was created with to use it.
Regards,
Giles