Boyer-More-Horspool searching LIKE queries

Started by Atsushi Ogawaabout 4 years ago3 messages

atsushi.ogawa@gmail.com

about 4 years ago

1 attachment(s)

I have created a patch to enable the Boyer-More-Horspool search
algorithm (B-M-H) for LIKE queries.

B-M-H needs to initialize the skip table and keep it during SQL execution.
In this patch, flinfo->fn_extra is used to keep the skip table.

The conditions under which B-M-H can be used are as follows.

(1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
Multibyte character sets does not support, because it may contain another
characters in the byte sequence. For UTF-8, it works fine, because in
UTF-8 the byte sequence of one character cannot contain another character.

(2) The pattern string should be stable parameter, because B-M-H needs to
keep
the skip table generated from the pattern string during the execution of
the query.

(3) The pattern string should be at least 4 characters.
For example, '%AB%' can use B-M-H.

(4) The first and last character of the pattern string should be '%'.

(5) Characters other than the first and last of the pattern string
should not be '%', '_'. However, escaped characters such as
'\%', '\_' are available.

Also, this patch changes the collation validity check in functions
(such as textlike) to be performed at the first execution of the query,
instead of each function execution.
I have measured the performance with the following query.

---------- ---------- ---------- ---------- ---------- ---------- ----------
SET client_min_messages TO notice;

\timing

DO $$
DECLARE
cnt integer := 0;
total integer := 0;
BEGIN
FOR i IN 1..500 LOOP
select count(*) into cnt
from pg_catalog.pg_description d
where d.description like '%greater%';

total := total + cnt;
END LOOP;

RAISE NOTICE 'TOTAL: %', total;
END
$$
;
---------- ---------- ---------- ---------- ---------- ---------- ----------

Result
Without patch: 257.504ms
With patch: 191.638ms

Regards,
Atsushi Ogawa

Attachments:

like_bmh.patchapplication/octet-stream; name=like_bmh.patchDownload

diff --git a/src/backend/utils/adt/like.c b/src/backend/utils/adt/like.c
index 9f241dc7c6..5a7f510ec5 100644
--- a/src/backend/utils/adt/like.c
+++ b/src/backend/utils/adt/like.c
@@ -45,7 +45,7 @@ static int	UTF8_MatchText(const char *t, int tlen, const char *p, int plen,
 static int	SB_IMatchText(const char *t, int tlen, const char *p, int plen,
 						  pg_locale_t locale, bool locale_is_c);
 
-static int	GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation);
+static int	GenericMatchText(const char *s, int slen, const char *p, int plen);
 static int	Generic_Text_IC_like(text *str, text *pat, Oid collation);
 
 /*--------------------
@@ -146,9 +146,9 @@ SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
 
 #include "like_match.c"
 
-/* Generic for all cases not requiring inline case-folding */
-static inline int
-GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation)
+/* Check collation for like search */
+static void
+CheckCollationForLike(Oid collation)
 {
 	if (collation && !lc_ctype_is_c(collation) && collation != DEFAULT_COLLATION_OID)
 	{
@@ -159,7 +159,12 @@ GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("nondeterministic collations are not supported for LIKE")));
 	}
+}
 
+/* Generic for all cases not requiring inline case-folding */
+static inline int
+GenericMatchText(const char *s, int slen, const char *p, int plen)
+{
 	if (pg_database_encoding_max_length() == 1)
 		return SB_MatchText(s, slen, p, plen, 0, true);
 	else if (GetDatabaseEncoding() == PG_UTF8)
@@ -234,6 +239,157 @@ Generic_Text_IC_like(text *str, text *pat, Oid collation)
 	}
 }
 
+/*--------------------
+ * Support routines for Boyer-Moore-Horspool(B-M-H) search algorithm.
+ *--------------------
+ */
+#define BMH_SKIP_TABLE_SIZE	256
+typedef struct {
+	bool	is_bmh_available;
+	int		skip_table[BMH_SKIP_TABLE_SIZE];
+	int		pat_len;
+	char	pat[FLEXIBLE_ARRAY_MEMBER];
+} BMHState;
+
+/*
+ * Check whether B-M-H in LIKE search is available for the pattern string.
+ * B-M-H in LIKE search is available if pattern string such as
+ * ... LIKE '%WORD%' are specified.
+ */
+static bool CheckBMHAvailable(const char *p, int plen, FmgrInfo *flinfo)
+{
+	int		i;
+
+	/*
+	 * B-M-H in LIKE search supports only single byte character set and UTF8.
+	 * Multibyte character set does not support, because it may contain another
+	 * character in the byte sequence. For UTF-8, that works fine, because in
+	 * UTF-8 the byte sequence of one character cannot contain another character.
+	 */
+	if(pg_database_encoding_max_length() != 1 &&
+		GetDatabaseEncoding() != PG_UTF8) return false;
+
+	/*
+	 * The pattern string should be stable param, because B-M-H needs to keep
+	 * the skip table generated from the pattern string during the execution
+	 * of the query.
+	 */
+	if(!get_fn_expr_arg_stable(flinfo, 1)) return false;
+
+	/*
+	 * The pattern string should be at least 4 characters.
+	 * For example, '%AB%' can use B-M-H.
+	 */
+	if(plen < 4) return false;
+
+	/*
+	 * The first and last character of the pattern string should be '%'.
+	 */
+	if(p[0] != '%' || p[plen - 1] != '%') return false;
+
+	/*
+	 * Characters other than the first and last of the pattern string
+	 * should not be '%', '_'.  However, escaped characters such as
+	 * '\%', '\_' are available for B-M-H.
+	 */
+	for(i = 1; i < plen - 1; i++) {
+		if(p[i] == '%' || p[i] == '_') return false;
+		if(p[i] == '\\') {
+			i++;
+		}
+	}
+
+	/*
+	 * B-M-H is available for the pattern string.
+	 */
+	return true;
+}
+
+/*
+ * Initialize data for B-M-H.
+ */
+static void InitBMHState(const char *p, int plen, FmgrInfo *flinfo)
+{
+	BMHState   *state;
+	bool		is_bmh_available;
+	int			i;
+	int			alloc_size;
+
+	is_bmh_available = CheckBMHAvailable(p, plen, flinfo);
+
+	if(is_bmh_available) {
+		alloc_size = sizeof(BMHState) + (plen * sizeof(char));
+		state = (BMHState *)MemoryContextAlloc(flinfo->fn_mcxt, alloc_size);
+		state->is_bmh_available = is_bmh_available;
+
+		/*
+		 * Copy the pattern string, removing the first and last '%'.
+		 * And also removes the escape character.
+		 * For example, "%A\_B%" becomes "A_B".
+		 */
+		state->pat_len = 0;
+		for(i = 1; i < plen - 1; i++) {
+			if(p[i] == '\\') {
+				i++;
+			}
+			state->pat[state->pat_len] = p[i];
+			state->pat_len++;
+		}
+		state->pat[state->pat_len] = '\0';
+
+		/*
+		 * Initialyze skip table for B-M-H.
+		 */
+		for(i = 0; i < BMH_SKIP_TABLE_SIZE; i++) {
+			state->skip_table[i] = state->pat_len;
+		}
+		for(i = 0; i < state->pat_len - 1; i++) {
+			unsigned char c = (unsigned char)state->pat[i];
+			state->skip_table[c] = state->pat_len - i - 1;
+		}
+	} else {
+		/*
+		 * Mark that B-M-H is unavailable.
+		 */
+		alloc_size = sizeof(BMHState);
+		state = (BMHState *)MemoryContextAlloc(flinfo->fn_mcxt, alloc_size);
+		state->is_bmh_available = is_bmh_available;
+	}
+
+	flinfo->fn_extra = state;
+}
+
+/*
+ * Perform a LIKE search using B-M-H.
+ */
+static int BMHSearch(const char *str, int str_len, const BMHState *state)
+{
+	const int  *skip_table = state->skip_table;
+	const char *pat = state->pat;
+	const char *p_last = state->pat + state->pat_len - 1;
+	int			j = state->pat_len - 1;
+
+	/*
+	 * B-M-H checks the string match from the end of the pattern string.
+	 * For example, the pattern string is "ABC", check in the order of 'C', 'B', 'A'.
+	 */
+	while(j <= (str_len - 1)) {
+		const char	   *p = p_last;
+		const char	   *s = str + j;
+		unsigned char	c = (unsigned char)*s;
+
+		while(*p == *s) {
+			if(p == pat) return LIKE_TRUE;
+			p--;
+			s--;
+		}
+
+		j += skip_table[c];
+	}
+
+	return LIKE_FALSE;
+}
+
 /*
  *	interface routines called by the function manager
  */
@@ -248,13 +404,27 @@ namelike(PG_FUNCTION_ARGS)
 			   *p;
 	int			slen,
 				plen;
+	FmgrInfo   *flinfo;
+	BMHState   *state;
 
 	s = NameStr(*str);
 	slen = strlen(s);
 	p = VARDATA_ANY(pat);
 	plen = VARSIZE_ANY_EXHDR(pat);
 
-	result = (GenericMatchText(s, slen, p, plen, PG_GET_COLLATION()) == LIKE_TRUE);
+	flinfo = fcinfo->flinfo;
+	if(flinfo->fn_extra == NULL) {
+		CheckCollationForLike(PG_GET_COLLATION());
+		InitBMHState(p, plen, flinfo);
+	}
+
+	state = (BMHState *)flinfo->fn_extra;
+
+	if(state->is_bmh_available) {
+		result = (BMHSearch(s, slen, state) == LIKE_TRUE);
+	} else {
+		result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);
+	}
 
 	PG_RETURN_BOOL(result);
 }
@@ -269,13 +439,27 @@ namenlike(PG_FUNCTION_ARGS)
 			   *p;
 	int			slen,
 				plen;
+	FmgrInfo   *flinfo;
+	BMHState   *state;
 
 	s = NameStr(*str);
 	slen = strlen(s);
 	p = VARDATA_ANY(pat);
 	plen = VARSIZE_ANY_EXHDR(pat);
 
-	result = (GenericMatchText(s, slen, p, plen, PG_GET_COLLATION()) != LIKE_TRUE);
+	flinfo = fcinfo->flinfo;
+	if(flinfo->fn_extra == NULL) {
+		CheckCollationForLike(PG_GET_COLLATION());
+		InitBMHState(p, plen, flinfo);
+	}
+
+	state = (BMHState *)flinfo->fn_extra;
+
+	if(state->is_bmh_available) {
+		result = (BMHSearch(s, slen, state) != LIKE_TRUE);
+	} else {
+		result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);
+	}
 
 	PG_RETURN_BOOL(result);
 }
@@ -290,13 +474,27 @@ textlike(PG_FUNCTION_ARGS)
 			   *p;
 	int			slen,
 				plen;
+	FmgrInfo   *flinfo;
+	BMHState   *state;
 
 	s = VARDATA_ANY(str);
 	slen = VARSIZE_ANY_EXHDR(str);
 	p = VARDATA_ANY(pat);
 	plen = VARSIZE_ANY_EXHDR(pat);
 
-	result = (GenericMatchText(s, slen, p, plen, PG_GET_COLLATION()) == LIKE_TRUE);
+	flinfo = fcinfo->flinfo;
+	if(flinfo->fn_extra == NULL) {
+		CheckCollationForLike(PG_GET_COLLATION());
+		InitBMHState(p, plen, flinfo);
+	}
+
+	state = (BMHState *)flinfo->fn_extra;
+
+	if(state->is_bmh_available) {
+		result = (BMHSearch(s, slen, state) == LIKE_TRUE);
+	} else {
+		result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);
+	}
 
 	PG_RETURN_BOOL(result);
 }
@@ -311,13 +509,27 @@ textnlike(PG_FUNCTION_ARGS)
 			   *p;
 	int			slen,
 				plen;
+	FmgrInfo   *flinfo;
+	BMHState   *state;
 
 	s = VARDATA_ANY(str);
 	slen = VARSIZE_ANY_EXHDR(str);
 	p = VARDATA_ANY(pat);
 	plen = VARSIZE_ANY_EXHDR(pat);
 
-	result = (GenericMatchText(s, slen, p, plen, PG_GET_COLLATION()) != LIKE_TRUE);
+	flinfo = fcinfo->flinfo;
+	if(flinfo->fn_extra == NULL) {
+		CheckCollationForLike(PG_GET_COLLATION());
+		InitBMHState(p, plen, flinfo);
+	}
+
+	state = (BMHState *)flinfo->fn_extra;
+
+	if(state->is_bmh_available) {
+		result = (BMHSearch(s, slen, state) != LIKE_TRUE);
+	} else {
+		result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);
+	}
 
 	PG_RETURN_BOOL(result);
 }
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 0f95b9400b..8360d9577c 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -1645,6 +1645,67 @@ SELECT 'be_r' NOT LIKE '__e__r' ESCAPE '_' AS "true";
  t
 (1 row)
 
+-- boyer-moore-horspool
+SELECT 'abcd' LIKE '%ab%' as "true";
+ true 
+------
+ t
+(1 row)
+
+SELECT 'abcd' LIKE '%bc%' as "true";
+ true 
+------
+ t
+(1 row)
+
+SELECT 'abcd' LIKE '%cd%' as "true";
+ true 
+------
+ t
+(1 row)
+
+SELECT 'a%cd' LIKE '%a\%c%' as "true";
+ true 
+------
+ t
+(1 row)
+
+SELECT 'a_cd' LIKE '%a\_c%' as "true";
+ true 
+------
+ t
+(1 row)
+
+SELECT 'abcd' NOT LIKE '%ab%' as "false";
+ false 
+-------
+ f
+(1 row)
+
+SELECT 'abcd' NOT LIKE '%bc%' as "false";
+ false 
+-------
+ f
+(1 row)
+
+SELECT 'abcd' NOT LIKE '%cd%' as "false";
+ false 
+-------
+ f
+(1 row)
+
+SELECT 'a%cd' NOT LIKE '%a\%c%' as "false";
+ false 
+-------
+ f
+(1 row)
+
+SELECT 'a_cd' NOT LIKE '%a\_c%' as "false";
+ false 
+-------
+ f
+(1 row)
+
 --
 -- test ILIKE (case-insensitive LIKE)
 -- Be sure to form every test as an ILIKE/NOT ILIKE pair.
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index 8c379182cb..8983690fd4 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -452,6 +452,18 @@ SELECT 'be_r' NOT LIKE 'b_e__r' ESCAPE '_' AS "false";
 SELECT 'be_r' LIKE '__e__r' ESCAPE '_' AS "false";
 SELECT 'be_r' NOT LIKE '__e__r' ESCAPE '_' AS "true";
 
+-- boyer-moore-horspool
+SELECT 'abcd' LIKE '%ab%' as "true";
+SELECT 'abcd' LIKE '%bc%' as "true";
+SELECT 'abcd' LIKE '%cd%' as "true";
+SELECT 'a%cd' LIKE '%a\%c%' as "true";
+SELECT 'a_cd' LIKE '%a\_c%' as "true";
+
+SELECT 'abcd' NOT LIKE '%ab%' as "false";
+SELECT 'abcd' NOT LIKE '%bc%' as "false";
+SELECT 'abcd' NOT LIKE '%cd%' as "false";
+SELECT 'a%cd' NOT LIKE '%a\%c%' as "false";
+SELECT 'a_cd' NOT LIKE '%a\_c%' as "false";
 
 --
 -- test ILIKE (case-insensitive LIKE)

Heikki Linnakangas

hlinnaka@iki.fi

about 4 years ago

In reply to: Atsushi Ogawa (#1)

Re: Boyer-More-Horspool searching LIKE queries

On 11/01/2022 15:55, Atsushi Ogawa wrote:

I have created a patch to enable the Boyer-More-Horspool search
algorithm (B-M-H) for LIKE queries.

Cool!

The conditions under which B-M-H can be used are as follows.

(1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
Multibyte character sets does not support, because it may contain another
characters in the byte sequence. For UTF-8, it works fine, because in
UTF-8 the byte sequence of one character cannot contain another character.

You can make it work with any encoding, if you check for that case after
you find a match. See how text_position() does it.

(3) The pattern string should be at least 4 characters.
For example, '%AB%' can use B-M-H.

To be precise, the patch checks that the pattern string is at least 4
*bytes* long. A pattern like E'%\U0001F418%' would benefit too.

If I'm reading the code correctly, it doesn't account for escapes
correctly. It will use B-M-H for a pattern like '%\\%', even though
that's just searching for a single backslash and won't benefit from B-M-H.

(4) The first and last character of the pattern string should be '%'.

I wonder if we can do better than that. If you have a pattern like
'%foo%bar', its pretty obvious (to a human) that you can quickly check
if the string ends in 'bar', and then check if it also contains the
substring 'foo'. Is there some way to generalize that?

Looking at MatchText() in like.c, there is this piece of code:

else if (*p == '%')
{
char firstpat;

/*
* % processing is essentially a search for a text position at
* which the remainder of the text matches the remainder of the
* pattern, using a recursive call to check each potential match.
*
* If there are wildcards immediately following the %, we can skip
* over them first, using the idea that any sequence of N _'s and
* one or more %'s is equivalent to N _'s and one % (ie, it will
* match any sequence of at least N text characters). In this way
* we will always run the recursive search loop using a pattern
* fragment that begins with a literal character-to-match, thereby
* not recursing more than we have to.
*/
NextByte(p, plen);

while (plen > 0)
{
if (*p == '%')
NextByte(p, plen);
else if (*p == '_')
{
/* If not enough text left to match the pattern, ABORT */
if (tlen <= 0)
return LIKE_ABORT;
NextChar(t, tlen);
NextByte(p, plen);
}
else
break; /* Reached a non-wildcard pattern char */
}

Could we use B-M-H to replace that piece of code?

How does the performance compare with regular expressions? Would it be
possible to use this for trivial regular expressions too? Or could we
speed up the regexp engine to take advantage of B-M-H, and use it for
LIKE? Or something like that?

I have measured the performance with the following query.

Setting up the B-M-H table adds some initialization overhead, so this
would be a loss for cases where the LIKE is executed only once, and/or
the haystack strings are very small. That's probably OK, the overhead is
probably small, and those cases are probably not performance-critical.
But would be nice to measure that too.

- Heikki

Atsushi Ogawa

atsushi.ogawa@gmail.com

almost 4 years ago

In reply to: Heikki Linnakangas (#2)

Re: Boyer-More-Horspool searching LIKE queries

Thanks for the comments.

The conditions under which B-M-H can be used are as follows.

(1) B-M-H in LIKE search supports only single-byte character sets and

UTF8.

Multibyte character sets does not support, because it may contain

another

characters in the byte sequence. For UTF-8, it works fine, because in
UTF-8 the byte sequence of one character cannot contain another

character.

You can make it work with any encoding, if you check for that case after
you find a match. See how text_position() does it.

I saw the text_position(). I would like to do the same.

(3) The pattern string should be at least 4 characters.
For example, '%AB%' can use B-M-H.

To be precise, the patch checks that the pattern string is at least 4
*bytes* long. A pattern like E'%\U0001F418%' would benefit too.

*bytes* is precise. I will revise the comment of code.

If I'm reading the code correctly, it doesn't account for escapes
correctly. It will use B-M-H for a pattern like '%\\%', even though
that's just searching for a single backslash and won't benefit from B-M-H.

You are correct. I will fix it.

(4) The first and last character of the pattern string should be '%'.

I wonder if we can do better than that. If you have a pattern like
'%foo%bar', its pretty obvious (to a human) that you can quickly check
if the string ends in 'bar', and then check if it also contains the
substring 'foo'. Is there some way to generalize that?

I think the following optimizations are possible.

(1)%foo%bar
Check if the string ends with 'bar' and search for 'foo' by B-M-H.

(2)foo%bar%
Check if the string starts with 'foo' and search for 'bar' by B-M-H.

(3)foo%bar%baz
Check if the string starts with 'foo' and string ends with 'baz' and
search for 'bar' by B-M-H.

Looking at MatchText() in like.c, there is this piece of code:

else if (*p == '%')
{
char firstpat;

/*
* % processing is essentially a search for a

text position at

* which the remainder of the text matches the

remainder of the

* pattern, using a recursive call to check each

potential match.

*
* If there are wildcards immediately following

the %, we can skip

* over them first, using the idea that any

sequence of N _'s and

* one or more %'s is equivalent to N _'s and one

% (ie, it will

* match any sequence of at least N text

characters). In this way

* we will always run the recursive search loop

using a pattern

* fragment that begins with a literal

character-to-match, thereby

* not recursing more than we have to.
*/
NextByte(p, plen);

while (plen > 0)
{
if (*p == '%')
NextByte(p, plen);
else if (*p == '_')
{
/* If not enough text left to

match the pattern, ABORT */

if (tlen <= 0)
return LIKE_ABORT;
NextChar(t, tlen);
NextByte(p, plen);
}
else
break; /* Reached a

non-wildcard pattern char */

}

Could we use B-M-H to replace that piece of code?

For example, in a pattern such as %foo%bar%, it is possible to first search
for 'foo' by B-M-H, and then search for 'bar' by B-M-H. It would be nice if
such a
process could be generalized to handle various LIKE search patterns.

How does the performance compare with regular expressions? Would it be
possible to use this for trivial regular expressions too? Or could we
speed up the regexp engine to take advantage of B-M-H, and use it for
LIKE? Or something like that?

I think regular expressions in postgresql is slower than LIKE.
I compared it with the following two SQLs.

(1)LIKE: execution time is about 0.8msec
select count(*) from pg_catalog.pg_description d where d.description like
'%greater%';

(2)regular expression: execution time is about 3.1 msec
select count(*) from pg_catalog.pg_description d where d.description ~
'greater';

For trivial regular expressions, it may be better to use LIKE.

I have measured the performance with the following query.

Setting up the B-M-H table adds some initialization overhead, so this
would be a loss for cases where the LIKE is executed only once, and/or
the haystack strings are very small. That's probably OK, the overhead is
probably small, and those cases are probably not performance-critical.
But would be nice to measure that too.

I tried to measure the case where LIKE is executed only once and
the haystack string are very small.

---------------------------------------------------------------
SET client_min_messages TO notice;

\timing

DO $$
DECLARE
cnt integer := 0;
total integer := 0;
BEGIN
FOR i IN 1..10000 LOOP
select count(*) into cnt from pg_class where oid = 2662 and relname
like '%cl%';
total := total + cnt;
END LOOP;

RAISE NOTICE 'TOTAL: %', total;
END
$$
;
---------------------------------------------------------------

without patch: 74.499msec
with patch 77.321msec

In this case, the patched version will be a few percent slower, but I think
the overhead is small.

Regards,
Atsushi Ogawa

2022年1月12日(水) 5:17 Heikki Linnakangas <hlinnaka@iki.fi>:

Show quoted text

On 11/01/2022 15:55, Atsushi Ogawa wrote:

I have created a patch to enable the Boyer-More-Horspool search
algorithm (B-M-H) for LIKE queries.

Cool!

The conditions under which B-M-H can be used are as follows.

(1) B-M-H in LIKE search supports only single-byte character sets and

UTF8.

Multibyte character sets does not support, because it may contain another
characters in the byte sequence. For UTF-8, it works fine, because in
UTF-8 the byte sequence of one character cannot contain another

character.

You can make it work with any encoding, if you check for that case after
you find a match. See how text_position() does it.

(3) The pattern string should be at least 4 characters.
For example, '%AB%' can use B-M-H.

To be precise, the patch checks that the pattern string is at least 4
*bytes* long. A pattern like E'%\U0001F418%' would benefit too.

If I'm reading the code correctly, it doesn't account for escapes
correctly. It will use B-M-H for a pattern like '%\\%', even though
that's just searching for a single backslash and won't benefit from B-M-H.

(4) The first and last character of the pattern string should be '%'.

I wonder if we can do better than that. If you have a pattern like
'%foo%bar', its pretty obvious (to a human) that you can quickly check
if the string ends in 'bar', and then check if it also contains the
substring 'foo'. Is there some way to generalize that?

Looking at MatchText() in like.c, there is this piece of code:

else if (*p == '%')
{
char firstpat;

/*
* % processing is essentially a search for a text

position at

* which the remainder of the text matches the

remainder of the

* pattern, using a recursive call to check each

potential match.

*
* If there are wildcards immediately following

the %, we can skip

* over them first, using the idea that any

sequence of N _'s and

* one or more %'s is equivalent to N _'s and one

% (ie, it will

* match any sequence of at least N text

characters). In this way

* we will always run the recursive search loop

using a pattern

* fragment that begins with a literal

character-to-match, thereby

* not recursing more than we have to.
*/
NextByte(p, plen);

while (plen > 0)
{
if (*p == '%')
NextByte(p, plen);
else if (*p == '_')
{
/* If not enough text left to

match the pattern, ABORT */

if (tlen <= 0)
return LIKE_ABORT;
NextChar(t, tlen);
NextByte(p, plen);
}
else
break; /* Reached a

non-wildcard pattern char */

}

Could we use B-M-H to replace that piece of code?

How does the performance compare with regular expressions? Would it be
possible to use this for trivial regular expressions too? Or could we
speed up the regexp engine to take advantage of B-M-H, and use it for
LIKE? Or something like that?

I have measured the performance with the following query.

Setting up the B-M-H table adds some initialization overhead, so this
would be a loss for cases where the LIKE is executed only once, and/or
the haystack strings are very small. That's probably OK, the overhead is
probably small, and those cases are probably not performance-critical.
But would be nice to measure that too.

- Heikki