LIKE indexing
Here's the patch for review. It adds the non-locale operator classes for
all three character types and the analogous selectivity estimation
changes. Basically, I'm confident this works, but as some people seem to
have doubts I show it here first.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Attachments:
I'm fairly close to committing a wholesale rearrangement of pg_opclass
and friends (per previous discussions, mostly with Oleg). This is going
to create some conflicts with your patch :-(. Who gets to go first?
regards, tom lane
Tom Lane writes:
I'm fairly close to committing a wholesale rearrangement of pg_opclass
and friends (per previous discussions, mostly with Oleg). This is going
to create some conflicts with your patch :-(. Who gets to go first?
If your changes only conflict in the system catalog headers I can redo
those when you're done. (Presuming there's documentation coming along.)
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes:
If your changes only conflict in the system catalog headers I can redo
those when you're done. (Presuming there's documentation coming along.)
Actually, all those uses of op_class() are going to conflict too...
op_class doesn't need an AM OID parameter anymore (and besides which,
I renamed it to op_in_opclass).
If you feel ready to commit, do so, and I'll do the merge.
regards, tom lane
Peter Eisentraut <peter_e@gmx.net> writes:
Here's the patch for review.
A few gripes:
+ The optimizer can also use a B-Tree index for queries involving the
+ pattern matching operators <literal>LIKE</>,
+ <literal>ILIKE</literal>, <literal>~</literal>, and
+ <literal>~*</literal>, <emphasis>if</emphasis> the pattern is
+ anchored to the beginning of the string, e.g., <literal>col LIKE
+ 'foo%'</literal> or <literal>col ~ '^foo'</literal>, but not
+ <literal>col LIKE 'bar'</literal>. However, if your server does
The "but not" part is wrong: col LIKE 'bar' works perfectly fine as
an indexable LIKE query. Perhaps you meant "but not col LIKE '%foo'".
While it's okay to treat text and varchar alike, I object to treating
bpchar as equivalent to the other two. Shouldn't the bpchar versions of
these functions strip trailing spaces before comparing?
Seems to me you should provide "$<>$" operators for completeness, even
though they're not essential for btree opclasses. I think that these
operators may be useful for more than just this one purpose, so we
shouldn't set up artificial roadblocks.
I don't like the fact that you added expected-output rows to opr_sanity;
seems like tweaking the queries to allow $<$ etc as expected names would
be more appropriate.
regards, tom lane
Tom Lane writes:
The "but not" part is wrong: col LIKE 'bar' works perfectly fine as
an indexable LIKE query. Perhaps you meant "but not col LIKE '%foo'".
Thanks. That was a mixup with the POSIX regexp style.
While it's okay to treat text and varchar alike, I object to treating
bpchar as equivalent to the other two. Shouldn't the bpchar versions of
these functions strip trailing spaces before comparing?
I had thought a long time about this and I couldn't see a reason why.
The reason is that the LIKE operator for bpchar does take the blanks into
account, so it effectively doesn't care whether the blanks are the result
of padding or explicit input. E.g.,
peter=# set enable_indexscan to off;
SET VARIABLE
peter=# create table test1 (a char(5));
CREATE
peter=# insert into test1 values ('four');
INSERT 16560 1
peter=# select * from test1 where a like 'four'::bpchar;
a
---
(0 rows)
/*
* If we had stripped spaces here we would have gotten a false positive.
*/
peter=# select * from test1 where a like 'fou_'::bpchar;
a
---
(0 rows)
/*
* Since the padding here is after the wildcard character and is thus
* stripped in the analysis, the augmented expression still holds.
*/
peter=# select * from test1 where a like 'fou%'::bpchar;
a
-------
four
(1 row)
/* same here */
I would also argue that the notion of a direct binary comparision would
not benefit from space stripping.
Seems to me you should provide "$<>$" operators for completeness, even
though they're not essential for btree opclasses.
Will do.
I don't like the fact that you added expected-output rows to opr_sanity;
seems like tweaking the queries to allow $<$ etc as expected names would
be more appropriate.
Ok.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes:
peter=# create table test1 (a char(5));
CREATE
peter=# insert into test1 values ('four');
INSERT 16560 1
peter=# select * from test1 where a like 'four'::bpchar;
a
---
(0 rows)
I think this is an erroneous result, actually, seeing as how
regression=# select 'four '::bpchar = 'four'::bpchar;
?column?
----------
t
(1 row)
How can A = B not imply A LIKE B? (This may be related to Hiroshi's
concerns.)
I dug into the spec to see what it has to say, and came up with this
rather opaque prose:
4) If the i-th substring specifier of PCV is neither an
arbitrary character specifier nor an arbitrary string
specifier, then the i-th substring of MCV is equal to
that substring specifier according to the collating
sequence of the <like predicate>, without the appending
of <space> characters to MCV, and has the same length as
that substring specifier.
The bit about "without the appending of <space> characters" *might*
mean that LIKE is always supposed to treat trailing blanks as
significant, but I'm not sure. The text does seem to say that it's okay
to add trailing blanks to the pattern to produce a match, when the
collating sequence is PAD SPACE type (bpchar in our terms).
In any case, Hiroshi is dead right that LIKE is supposed to perform
collating-sequence-dependent comparison, and this probably means that
this whole approach is a dead end :-(
regards, tom lane
Tom Lane writes:
How can A = B not imply A LIKE B?
Well, according to my reading of the spec, it apparently can. Space
padding can be weird that way. But see below why I think there are much
worse alternatives.
4) If the i-th substring specifier of PCV is neither an
arbitrary character specifier nor an arbitrary string
specifier, then the i-th substring of MCV is equal to
that substring specifier according to the collating
sequence of the <like predicate>, without the appending
of <space> characters to MCV, and has the same length as
that substring specifier.The bit about "without the appending of <space> characters" *might*
mean that LIKE is always supposed to treat trailing blanks as
significant, but I'm not sure.
That's how I read it.
The text does seem to say that it's okay to add trailing blanks to the
pattern to produce a match, when the collating sequence is PAD SPACE
type (bpchar in our terms).
I can't find that.
In any case, Hiroshi is dead right that LIKE is supposed to perform
collating-sequence-dependent comparison,
As I have answered to Hiroshi, I think that would really be brain-dead.
It would alienate LIKE from how pattern matching normally operates. If we
make the assumption that strcoll(A, B) can be 0 for wildly different
values of A and B (for an appropriate definition of "different"), then the
following things could happen:
-> A = B does not imply A ~ B
-> A LIKE 'foobar%' does not imply A LIKE 'foo%' (because 'foobar' is a
single collating element that sorts like 'xyz').
-> A LIKE '%foo%' does not imply that POSITION('foo' IN A) <> 0 (The SQL
POSITION function does not mention using the collating sequence.)
I'm also quite suspicious about the wording "...and has the same length as
that substring specifier." For instance, it might be nearly reasonable to
define a German locale where � (u umlaut) and ue are equivalent. But then
while 'x�y' = 'xuey' (a strict interpretation of the SQL standard might
deny this because of the padding, but "The result of the comparison of X
and Y is given by the collating sequence CS.", and I define mine that
way), but 'x�y' NOT LIKE 'xuey' because of that rule. Voil�, it can
happen after all.
I think this rule is a mistake designed by committee and must be struck
down by community. ;-)
and this probably means that this whole approach is a dead end :-(
Blech... ;-)
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Peter Eisentraut <peter_e@gmx.net> writes:
Tom Lane writes:
The text does seem to say that it's okay to add trailing blanks to the
pattern to produce a match, when the collating sequence is PAD SPACE
type (bpchar in our terms).
I can't find that.
The spec says "without the appending of <space> characters to MCV",
not "without the appending of <space> characters" full stop. I read
that to imply that it *is* okay to append spaces to the pattern PCV.
But it's not exactly transparently written in any case.
I'm also quite suspicious about the wording "...and has the same length as
that substring specifier."
Yeah, me too. Does that say that space padding isn't allowed? If so,
why the thrashing-about earlier in the sentence? In any case, it seems
to allow a locale-dependent case conversion, for instance.
In any case, Hiroshi is dead right that LIKE is supposed to perform
collating-sequence-dependent comparison,
As I have answered to Hiroshi, I think that would really be brain-dead.
It would alienate LIKE from how pattern matching normally operates.
Well, you make some good arguments, but I'd like to see a fairly strong
consensus that we think the SQL definition of LIKE is broken before we
decide to build a lot of superstructure on our definition of LIKE. So
far I haven't seen many comments on this thread...
regards, tom lane