LIKE indexing proposal

Started by Peter Eisentrautover 22 years ago7 messages
#1Peter Eisentraut
peter_e@gmx.net

I would like to re-table my proposal from many moons ago to allow
pattern-matching operations to use indexes under any locale.

The idea is that since pattern matching works on a character-by-character
basis, we also need to interact with the index using operators that do a
comparison strictly on a character-by-character basis. So we define a
separate set of comparison operators and operator classes for each
character type (text, varchar, bpchar, name), teach the optimizer to use
it for suitably anchored pattern-matching operations, and tell users to
create indexes using that special operator class if they want pattern
matching to use indexes.

Here are a couple of details to discuss:

I named the operators #<#, #>=#, etc. If someone can think of better
names, let me know.

Since character-by-character comparison is essentially binary comparison,
I named the operator classes, text_binary_ops, etc. Another idea is to
name them text_like_ops or text_pattern_ops or whatever, so that if some
change in the pattern matching operations would later force us to alter
the behavior of the operators away from being strictly memcmp(), we would
be free to change them.

Should there be a special case for the C locale? Without extra code, if
you use the C locale and would like to use indexes for both equality and
pattern comparisons, you need two indexes with different operator classes.
That might seem wasteful, but then all locales would really work the same.

I'm unclear on how the selectivity estimation should work. The system
doesn't collect statistics based on the new #<#-style operators, so the
estimates are based on the normal comparison, which might have little to
do with reality.

--
Peter Eisentraut peter_e@gmx.net

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: LIKE indexing proposal

Peter Eisentraut <peter_e@gmx.net> writes:

I would like to re-table my proposal from many moons ago to allow
pattern-matching operations to use indexes under any locale.

Do you have better answers to the objections that were raised the last
time?

In particular, I wonder whether it's worth putting effort into this at
all, rather than trying to make the contrib full-text-indexing support
mainstream-quality.

Since character-by-character comparison is essentially binary comparison,
I named the operator classes, text_binary_ops, etc. Another idea is to
name them text_like_ops or text_pattern_ops or whatever,

text_binary_ops seems a bit of a contradiction in terms :-(. I'd prefer
either of the other choices.

Should there be a special case for the C locale?

Yes, if only on backwards-compatibility grounds. Database setups that
worked well before should not get arbitrarily broken.

I'm unclear on how the selectivity estimation should work. The system
doesn't collect statistics based on the new #<#-style operators, so the
estimates are based on the normal comparison, which might have little to
do with reality.

Not sure that this really matters; the selectivity estimation for LIKE
is so crude that I doubt it would get significantly worse. But one
could conceive of storing statistics computed both ways in pg_statistic.

regards, tom lane

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#2)
Re: LIKE indexing proposal

Tom Lane writes:

Peter Eisentraut <peter_e@gmx.net> writes:

I would like to re-table my proposal from many moons ago to allow
pattern-matching operations to use indexes under any locale.

Do you have better answers to the objections that were raised the last
time?

The objection last time was that, according to the SQL standard, LIKE
patterns should use the locale-specific collation order for the fixed
parts (instead of character-by-character comparisons), and that under that
rule my proposed solution would fall down.

Time has shown that this objection is poor on four grounds:

1. LIKE does not work that way, and omitting optimizations now based on
possible future changes is a stupid approach.

2. If LIKE were to be changed that way, we can always take the
optimization out.

3. Even if LIKE were changed, certainly POSIX regexps would never be
changed, so you can still use the optimization there.

4. Nobody is seriously proposing to change LIKE that way. And if someone
did, I would object, because the established collation rules make getting
consistent results from this approach impossible.

--
Peter Eisentraut peter_e@gmx.net

#4Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Peter Eisentraut (#3)
Re: LIKE indexing proposal

Peter Eisentraut <peter_e@gmx.net> writes:

I would like to re-table my proposal from many moons ago to allow
pattern-matching operations to use indexes under any locale.

Wouldn't existing b-trees be sufficient, if they could be 'scanned' starting
with the operator >= ? Thus a LIKE 'ABC%' could be done by stepping an (ascending)
index fom x >= 'ABC' up to the first key that does not have 'ABC' as first
characters ?

Seems this would be of more general use, than an extra index for LIKE, no ?
Some upper bound would still be needed for a selectivity estimate, but for
estimation purposes it would not really need to be watertight.

Andreas

#5Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Zeugswetter Andreas SB SD (#4)
Re: LIKE indexing proposal

The objection last time was that, according to the SQL standard, LIKE
patterns should use the locale-specific collation order for the fixed
parts (instead of character-by-character comparisons),

Yes, I think that LIKE definitely needs to be character-by-character
and I can think of no example in the german language that would make
general sense (like ss and single byte sharp s (ß) would definitely be
nonsense if handled as equal for LIKE).

Andreas

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#4)
Re: LIKE indexing proposal

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

Wouldn't existing b-trees be sufficient, if they could be 'scanned' starting
with the operator >= ? Thus a LIKE 'ABC%' could be done by stepping an (ascending)
index fom x >= 'ABC' up to the first key that does not have 'ABC' as first
characters ?

You've apparently forgotten all our previous history on that subject
:-(. The above does not work in the presence of special sort rules for
digraphs, etc. For example, that LIKE should certainly match ABCH ...
but there are locales in which "CH" sorts after "D" and would not be
found by an indexscan that runs from ABC to ABD.

Even with no digraphs, the optimization is broken by locales that treat
spaces as second-class citizens, prefer caseless to case-sensitive
comparisons, etc. It doesn't work in en_US locale, for example.

regards, tom lane

#7Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#6)
Re: LIKE indexing proposal

You've apparently forgotten all our previous history on that subject
:-(. The above does not work in the presence of special sort rules for
digraphs, etc. For example, that LIKE should certainly match ABCH ...
but there are locales in which "CH" sorts after "D" and would not be
found by an indexscan that runs from ABC to ABD.

Sorry, so is the start condition of >= 'ABC' also not valid ?

A new index scan method with the pattern as input would need to know when it can
safely stop (which is the problem I understand), but would have the advantage of
beeing able to filter unwanted tuples before the heap access.

Andreas